## Removing Data

Sklearn breaks when introducing null values. Also, null values for the target variable y do not give any information, so it is rational to drop the rows with missing values.

**[Comments from the course notebook]** So, we can fit a model by dropping rows with missing values. This is great in that sklearn doesn't break! However, this means future observations will not obtain a prediction if they have missing values in any of the columns.

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

df = pd.read_csv('../dataset/survey_results/survey-results-public.csv')

#Subset to only quantitative vars
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 
               'JobSatisfaction', 'StackOverflowSatisfaction']]

num_vars.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
0,,,0.0,,9.0
1,,,,,8.0
2,113750.0,8.0,,9.0,8.0
3,,6.0,5.0,3.0,10.0
4,,6.0,,8.0,


#### Question 1

**1.** What proportion of individuals in the dataset reported a salary?

In [2]:
prop_sals = df.Salary.notnull().mean() # Proportion of individuals in the dataset with salary reported

prop_sals

0.25083670610211706

25.1% of the rows do not contain Salary information...which actually is a lot..

#### Question 2  <a id="q2"></a>

**2.** Remove the rows associated with nan values in Salary (only Salary) from the dataframe **num_vars**.  Store the dataframe with these rows removed in **sal_rem**.

In [3]:
sal_rm = num_vars.dropna(subset=['Salary']) # dataframe with rows for nan Salaries removed

sal_rm.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
2,113750.0,8.0,,9.0,8.0
14,100000.0,8.0,,8.0,8.0
17,130000.0,9.0,,8.0,8.0
18,82500.0,5.0,,3.0,
22,100764.0,8.0,,9.0,8.0


Small tutorials for `removing values` has been retrieved from the course notebook and placed Question 7. 

[Go to the tutorial](#tutorial)

#### Question 3

**3.** Using **sal_rm**, create **X** be a dataframe (matrix) of all of the numeric feature variables.  Then, let **y** be the response vector you would like to predict (Salary).  Run the cell below once you have split the data, and use the result of the code to assign the correct letter to **question3_solution**.

In [4]:
X = sal_rm[sal_rm.columns.difference(['Salary'])] #Create X using explanatory variables from sal_rm
y = sal_rm[['Salary']] #Create y using the response variable of Salary

# Split data into training and test data, and fit a linear model
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=.30, random_state=42)
lm_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_model.fit(X_train, y_train)
except:
    print("Oh no! It doesn't work!!!")


Oh no! It doesn't work!!!


In [5]:
X.isnull().sum()

CareerSatisfaction             30
HoursPerWeek                 7206
JobSatisfaction                39
StackOverflowSatisfaction     874
dtype: int64

The code broke because there still are missing values in the predictor variables X.

#### Question 4

**4.** Remove the rows associated with nan values in any column from **num_vars** (this was the removal process used in the screencast).  Store the dataframe with these rows removed in **all_rem**.

In [6]:
all_rm = num_vars.dropna() # default : how='any', axis=0

all_rm.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
25,175000.0,7.0,0.0,7.0,9.0
34,14838.709677,10.0,1.0,8.0,10.0
52,15674.203822,6.0,4.0,5.0,8.0
57,43010.752688,10.0,2.0,6.0,10.0
70,65000.0,8.0,2.0,5.0,7.0


#### Question 5

**5.** Using **all_rm**, create **X_2** be a dataframe (matrix) of all of the numeric feature variables.  Then, let **y_2** be the response vector you would like to predict (Salary).  Run the cell below once you have split the data, and use the result of the code to assign the correct letter to **question5_solution**.

In [7]:
X_2 = all_rm[all_rm.columns.difference(['Salary'])]#Create X using explanatory variables from all_rm
y_2 = all_rm[['Salary']] #Create y using Salary from sal_rm

# Split data into training and test data, and fit a linear model
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2 , test_size=.30, random_state=42)
lm_2_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_2_model.fit(X_2_train, y_2_train)
except:
    print("Oh no! It doesn't work!!!")

Now it finally works!

#### Question 6

**6.** Now, use **lm_2_model** to predict the **y_2_test** response values, and obtain an r-squared value for how well the predicted values compare to the actual test values.  

In [16]:
y_2_test_preds = lm_2_model.predict(X_2_test)# Predictions here using X_2 and lm_2_model
r2_test =  r2_score(y_2_test, y_2_test_preds)# Rsquared for comparing test and preds from lm_2_model
#RMSE = np.sqrt(mean_squared_error(y_2_test, y_test_preds))

# Print r2 to see result
r2_test

0.030994664959115625

**R-squared** is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean. [Retreived from the linked reference](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

Following the definition, the model explains 3.1% (R-squared score) of the variability of the response data ('Salary'), which is not really a good result but we just started this!

---
<a id="tutorial"></a> 
[Go back to Question 2](#q2)

## Tutorial for missing values
Some useful techniques for panda's dropna() method. Self-coded verified correct.

Additional resource can be found in [this link](https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/)

In [9]:
small_dataset = pd.DataFrame({'col1': [1, 2, np.nan, np.nan, 5, 6], 
                              'col2': [7, 8, np.nan, 10, 11, 12],
                              'col3': [np.nan, 14, np.nan, 16, 17, 18]})
small_dataset

Unnamed: 0,col1,col2,col3
0,1.0,7.0,
1,2.0,8.0,14.0
2,,,
3,,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


#### Question 1

**1.** Drop any row with a missing value.

In [10]:
all_drop  = small_dataset.dropna()# Drop any row with a missing value

#print result
all_drop

Unnamed: 0,col1,col2,col3
1,2.0,8.0,14.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


#### Question 2

**2.** Drop only the row with all missing values.

In [11]:
all_row = small_dataset.dropna(axis=0, how='all')# Drop only rows with all missing values 

#print result
all_row

Unnamed: 0,col1,col2,col3
0,1.0,7.0,
1,2.0,8.0,14.0
3,,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


#### Question 3

**3.** Drop only the rows with missing values in column 3.

In [12]:
only3_drop = small_dataset.dropna(axis=0, subset=['col3'])# Drop only rows with missing values in column 3

#print result
only3_drop

Unnamed: 0,col1,col2,col3
1,2.0,8.0,14.0
3,,10.0,16.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0


#### Question 4
**4.** Drop only the rows with missing values in column 3 or column 1.

In [13]:
only3or1_drop = small_dataset.dropna(subset=['col1','col3'], how='any') 

#print result
only3or1_drop

Unnamed: 0,col1,col2,col3
1,2.0,8.0,14.0
4,5.0,11.0,17.0
5,6.0,12.0,18.0
