## Imputing Values

This notebook continues with the contents of `Removing_Null.ipynb.` 


### Imputation Methods and Resources

One of the most common methods for working with missing values is by imputing the missing values.  Imputation means that you input a value for values that were originally missing. 

It is very common to impute in the following ways:
1. Impute the **mean** of a column.<br>
2. If you are working with categorical data or a variable with outliers, then use the **mode** of the column.<br>
3. Impute 0, a very small number, or a very large number to differentiate missing values from other values.<br>
4. Use knn to impute values based on features that are most similar.<br>

In general, you should try to be more careful with missing data in understanding the real world implications and reasons for why the missing values exist, instead of simply dropping or imputing missing values.

**[References]**
- Chris' content is again very helpful for many of these items - and you can access it [here](https://chrisalbon.com/).  
- He uses the [sklearn.preprocessing library](http://scikit-learn.org/stable/modules/preprocessing.html).  
- There are also a ton of ways to fill in missing values directly using pandas, which can be found [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

**[Go to the tutorial](#tutorial)** 

In [13]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

df = pd.read_csv('../dataset/survey_results/survey-results-public.csv')
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


### `1.` Model after dropping missing values 

In [14]:
#Only use quant variables and drop any rows with missing values

num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
df_dropna = num_vars.dropna(axis=0)

#Split into explanatory and response variables
X = df_dropna[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y = df_dropna['Salary']

#Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42) 

lm_model = LinearRegression(normalize=True) # Instantiate
lm_model.fit(X_train, y_train) #Fit
        
#Predict and score the model
y_test_preds = lm_model.predict(X_test) 
"The r-squared score for your model was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

'The r-squared score for your model was 0.030994664959115625 on 1602 values.'

### `2.` Model after imputing missing values
Imputing with features means

In [25]:
#Only use quant variables
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]

#Dropping rows with no salary information
df_sal_true = num_vars.dropna(subset=['Salary'])
df_sal_true.Salary.isnull().sum()

0

In [26]:
#Imputing with mean of each feature
fill_mean = lambda col : col.fillna(col.mean())
fill_df = df_sal_true.apply(fill_mean, axis=0)

fill_df.isnull().sum()

Salary                       0
CareerSatisfaction           0
HoursPerWeek                 0
JobSatisfaction              0
StackOverflowSatisfaction    0
dtype: int64

In [35]:
X = fill_df[fill_df.columns.difference(['Salary'])]
y = fill_df[['Salary']]

# Split into train, test set : test_size=0.3, random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fitting Linear Regression model
lm_model_2 = LinearRegression(normalize=True)
lm_model_2.fit(X_train, y_train)

# Predict value and score the result
y_test_preds = lm_model_2.predict(X_test)
r2_score_lm2 = r2_score(y_test, y_test_preds)

"The r-squared score for your model was {} on {} values.".format(r2_score_lm2, len(y_test))

'The r-squared score for your model was 0.04072431792894726 on 3868 values.'

The score for the model with imputation is never close to be perfect, but improved (not always!) and it predicted on a lot more (241% up) test data than simplying dropping missing values.

---
## Tutorial - imputing values 
<a id="tutorial"></a>

In [5]:
df = pd.DataFrame({'A':[np.nan, 2, np.nan, 0, 7, 10, 15],
                   'B':[3, 4, 5, 1, 2, 3, 5],
                   'C':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
                   'D':[np.nan, True, np.nan, False, True, False, np.nan],
                   'E':['Yes', 'No', 'Maybe', np.nan, np.nan, 'Yes', np.nan]})

df

Unnamed: 0,A,B,C,D,E
0,,3,,,Yes
1,2.0,4,,True,No
2,,5,,,Maybe
3,0.0,1,,False,
4,7.0,2,,True,
5,10.0,3,,False,Yes
6,15.0,5,,,


#### Question 1

**1.** Find the appropriate data type for each column.

A : categorical, B : quantitative, C : hard to tell, D : boolean, E : categorical

#### Question 2

**2.** Are there any columns or rows that you feel comfortable dropping in this dataframe?

In [6]:
# Use this cell to drop any columns or rows you feel comfortable dropping based on the above
new_df = df.dropna(how='all', axis=1)
new_df

Unnamed: 0,A,B,D,E
0,,3,,Yes
1,2.0,4,True,No
2,,5,,Maybe
3,0.0,1,False,
4,7.0,2,True,
5,10.0,3,False,Yes
6,15.0,5,,


#### Question 3

**3.** Try imputing missing values with the below function.

In [7]:
fill_mean = lambda col: col.fillna(col.mean())

try:
    new_df.apply(fill_mean, axis=0)
except:
    print('That broke...')

That broke...


Filling 
- column A is no problem - it fills the NaN values with the mean as expected.
- column D fills with the mean, but that doesn't actually make sense in this case.
- column E gives an error.

#### Question 4

**4.** Given the results above, it might make more sense to fill some columns with the mode.  Write a function to fill a column with the mode value, and use it on the two columns that might benefit from this type of imputation. 

In [8]:
fill_mode = lambda col: col.fillna(col.mode())
new_df.apply(fill_mode, axis=0)

Unnamed: 0,A,B,D,E
0,0.0,3,False,Yes
1,2.0,4,True,No
2,7.0,5,,Maybe
3,0.0,1,False,
4,7.0,2,True,
5,10.0,3,False,Yes
6,15.0,5,,
