# `pandas` Part 5: Finding and Replacing Values

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Check datatypes with `dtype`
2. Find and replace missing (null) values with `fillna()`
 

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [1]:
import pandas as pd
import numpy as np


In [2]:
ls

 Volume in drive C is Windows
 Volume Serial Number is 8650-7A23

 Directory of C:\Users\lukasz\Desktop\BAIM 3220 Python

05/08/2021  08:20 AM    <DIR>          .
05/08/2021  08:20 AM    <DIR>          ..
05/08/2021  08:20 AM    <DIR>          .ipynb_checkpoints
04/17/2021  10:07 AM        15,349,323 complete.csv
04/20/2021  09:28 AM               485 customer.csv
04/21/2021  05:16 PM            49,774 foodplot4.png
03/30/2021  09:57 AM             3,837 Future50.csv
04/20/2021  09:28 AM               206 invoice.csv
04/20/2021  09:28 AM               496 line.csv
04/13/2021  09:31 AM            38,013 loans.csv
04/19/2021  03:39 PM                 0 'merged.ipynb'
04/19/2021  03:46 PM                 0 'merged5.ipynb'
04/17/2021  10:46 AM                 0 -o
04/29/2021  01:04 PM           154,383 Pandas_FindandReplace_TypeAlong_.ipynb
04/19/2021  12:31 PM           157,794 Pandas_FindandReplace_TypeAlongCOMPLETED.ipynb
04/29/2021  01:04 PM           192,151 Pandas_Part4_Grouping_Type
























































# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- drop the unnamed column 

In [3]:
wine=pd.read_csv('winemag-data-130k-v2.csv')
data=wine.drop(["Unnamed: 0"], axis=1)
data

FileNotFoundError: [Errno 2] No such file or directory: 'winemag-data-130k-v2.csv'

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [4]:
data.shape


NameError: name 'data' is not defined

### Check a couple of rows of data

In [5]:
data.head()

NameError: name 'data' is not defined

### Another step in understanding the data you are working with is checking the data types
>- The analysis will differ depending on the data type
>>- For example, only number fields can be averaged
>>- Text/string analysis usually involves counts/frequencies 

### Checking datatypes with `dtype` and `dtypes`
>- General syntax for `dtype`: dataFrame.field.dtype
>>- Returns the datatype for one field
>- General syntax for `dtypes`: dataFrame.dtypes
>>- Returns the datatypes for all the fields in a dataframe

###  Check one field with `dtype`

In [6]:
data['price'].dtype

NameError: name 'data' is not defined

### Check all the fields in the data frame with `dtypes`

In [7]:
data.dtypes


NameError: name 'data' is not defined

### Question: What is the average price of all wines? 

In [8]:
data['price'].mean()

NameError: name 'data' is not defined

### Question: How many wines are there per country in the data frame? 

In [9]:
countries=data.groupby('country') # make our dataframe grouped by country
uniques=countries.agg({'title': "nunique"}) #aggregation function 
uniques

NameError: name 'data' is not defined

In [10]:
(data.groupby('country'))['title'].nunique()

NameError: name 'data' is not defined

##### Another way to get wines by country using `groupby`: 

In [11]:
countryGroup=data.groupby('country')
countryGroup['title'].nunique()

NameError: name 'data' is not defined

## What are the descriptive analytics for wine price?
>- Include the 10th and 90th percentiles of wines in the analysis
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [12]:
data['price'].describe()

NameError: name 'data' is not defined

## What are the descriptive analytics for country?  

In [13]:
data['country'].describe()

NameError: name 'data' is not defined

## What are the descriptive analytics for all numerical fields in the data frame? 
>- Note: By default describe() returns all numerical fields when called on a DataFrame. 
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [14]:
data.describe()

NameError: name 'data' is not defined

#### Question: Why would points and price have different count values? 

## What are the descriptive analytics for all non-numeric fields in the DataFrame? 
>- Note: we can use `select_dtypes` with the parameter `include='object'` to only include string fields.
>>- `select_dtypes(include='object')`
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes


In [15]:
data.describe(include="O")

NameError: name 'data' is not defined

## Finally, to include every field in the data frame:
>- use `describe(include='all')

In [16]:
data.describe(include='all')

NameError: name 'data' is not defined

# Notice how the fields in `wineReviews` vary in count? 
>- A common occurrence in datasets is missing (aka null) values
>- We can use `pd.isnull` to see all the null values for a particular field
>- We can use `pd.notnull()` to see only non-missing values for a particular field

#### Q: What are all the wines with missing country values?

In [17]:
nullCountries=data[data['country'].isnull()]
nullCountries

NameError: name 'data' is not defined

In [18]:
nullCountries['title'].tolist()


NameError: name 'nullCountries' is not defined

## Now, suppose we want to replace a missing value with `Unknown`
>- We can use a pandas function called `fillna()` and pass the value "Unknown" to it

#### Replace null values for `region_2` with 'Unknown'

In [19]:
data[['region_2']]=data[['region_2']].fillna(value="Unknown")
#this will set the region_2 column in our data frame to a new region 2 column with updated values.
data

NameError: name 'data' is not defined

# Using `replace()` to replace specific values
>- Suppose a taster in the dataset gets a new twitter handle
>>- We can can use `replace()` to update this data

#### Task: Kerin O'Keefe  is changing her twitter handle from `@kerinokeefe` to `@kerino`
>- Use pandas `replace()` to make the change in our DataFrame

In [20]:
data=data.replace({"@kerinokeefe":'@kerino'}) #uses dictionary style mapping

#change the data by using replace() function replaces all instances of kerinokeefe
data

NameError: name 'data' is not defined

# Section 2
## Dealing with missing data
>- Missing data is one of the most pervasive problems in data analysis
>- No matter what field you work in you will likely come across datasets that contain incomplete data for some records
>>- Missing data can occur because experimental units may die (e.g, rats in a clinical study), equipment malfunctions, respondents to surveys do not answer all questions, or simply someone that is in charge of recording data goofs. 
>- The seriousness of the missing data depends on the amount of missing data, the pattern of missing data, and why it is missing
>>- The why and the pattern of missing data is more important that the amount of missing data. However, missing data will have a larger impact on small datasets than larger datasets 

>- This section focuses on some common strategies for handling missing data

## Common strategies for dealing with missing data

Tabachnick & Fidell (2019) give us several commonly used methods for handling missing data values values. 

1. Remove any records that contain missing data 
>- If only a few records/cases have missing data and they seem to be a random subsample of the whole sample, deletion can be a good method of dealing with missing data
2. Estimating missing data
>- A second option is to estimate (impute) missing values and then use the estimates during analysis. Here are some common estimation methods
>>- Use prior knowledge to estimate the value. Here, the analyst/researcher replaces missing values with an educated case based on expertise in the area. 
>>- Mean replacement. Calculate the overall mean of the feature and impute that for all missing values. In absence of all other information, the mean is the best guess about the value of a feature/variable
>>- Median replacement. Calculate the median of the feature and impute that for all missing values
>>- Regression replacement. A more sophisticated approach would be to use a regression model and impute missing values based on the values of other features that we do have data on

#### Regardless of what method is used for missing data, it is recommended to:
1. Create a new feature that stores information on whether or not missing data was imputed
>- This is a binary column (usually 0's and 1's) indicating if missing data was imputed for a record or not
2. Repeat the analysis with and without missing data and imputation methods and determine if conclusions are the same under each circumstance

#### Reference: Tabachnick & Fidell (2019). *Using Multivariate Statistics*. Pearson.

## Practice imputing mean values for missing data

### Q7:  Calculate the mean price and store the mean in a variable, `meanPrice`
>- Round to two decimal places

In [21]:
meanPrice=round(data['price'].mean(),2)
meanPrice

NameError: name 'data' is not defined

In [22]:
data['price']=data['price'].fillna(meanPrice)
# change the price field by filling all occurrences to our meanPrice variable
data

NameError: name 'data' is not defined

### Q7: Create a column, `imputeFlag`, that stores a 1 if the record used the meanPrice and a 0 if it does not

In [23]:
data['imputeFlag']= [1 if x==35.36 else 0 for x in data['price']]



NameError: name 'data' is not defined

In [24]:
data

NameError: name 'data' is not defined

### We can do similar things with the `map()` function
>- `map()` is used to substitute each value in a Series with another value
>- General syntax: `Series.map(arg,na_action=None)`
>>- Where *arg* can be a function, a dictionary, or a Series

Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

>- For example, `data['price'].map(lambda row: 0 if row >0 else 1)` uses a lambda function as the *arg* passed to *map()* to transform all the values in the `price` column to a Series of 0's or 1's based on the logic used in the lambda function
>- You could also define your own function and then pass that function into *map()* 

### Q8: Replace null values for `designation` with 'NO DESIGNATION'

In [25]:
data['designation']=data['designation'].map(lambda row: "NO DESIGNATION" if pd.isnull(row) else row)
#change the designation field in our data to change all null instances to 'NO DESIGNATION'

NameError: name 'data' is not defined

#### Show the first five records of `wine` with all of your changes

In [26]:
data.head(5)

NameError: name 'data' is not defined

#### Show all the column names in `data`

In [27]:
data.columns

NameError: name 'data' is not defined

#### Show the total null values in each column in `wine`

In [28]:
data.isnull().sum(axis=0)

NameError: name 'data' is not defined