# Lesson 11: Linear Regression

We aim to predict the life expenctancy of different countries during different years using socioeconomic features.  

Linear regresssion is **not compatible** with missing values or categorical variables. 

## Imports

In [1]:
# This is to change the top level directory.
%cd ..

/Users/jaimemerizalde/Desktop/Library/Machine Learning Udemy Course


In [49]:
from Library import data
import pandas as pd

from sklearn.model_selection import train_test_split

pd.set_option("display.max_columns", None)

# Get Data

In [50]:
df = data.get_data("Datasets/LifeExpectancy.csv", index_col=[0])
df.shape

(2938, 22)

There are 22 columns / features and 2938 rows of data.

# Preprocessing

Removing missing values and categorical variables. They will not work with a linear model.  

## Missing Values

Check how many missing values.  
This will give us how many missing values, from most to least, there are in this datset.

In [4]:
df.isna().sum().sort_values(ascending=False)

population                         652
hepatitis_b                        553
gdp                                448
total_expenditure                  226
alcohol                            194
income_composition_of_resources    167
schooling                          163
thinness_5_9_years                  34
thinness__1_19_years                34
bmi                                 34
polio                               19
diphtheria                          19
life_expectancy                     10
adult_mortality                     10
hiv_aids                             0
country                              0
year                                 0
measles                              0
percentage_expenditure               0
infant_deaths                        0
status                               0
under_five_deaths                    0
dtype: int64

## Categorial Variables

In [51]:
categorical_columns = list(df.dtypes[df.dtypes == "O"].index.values)
categorical_columns

# Recast non-numerical/object columns as CAT_VARS.
for column in categorical_columns:
    df[column] = df[column].astype("category")

df.dtypes

country                            category
year                                  int64
status                             category
life_expectancy                     float64
adult_mortality                     float64
infant_deaths                         int64
alcohol                             float64
percentage_expenditure              float64
hepatitis_b                         float64
measles                               int64
bmi                                 float64
under_five_deaths                     int64
polio                               float64
total_expenditure                   float64
diphtheria                          float64
hiv_aids                            float64
gdp                                 float64
population                          float64
thinness__1_19_years                float64
thinness_5_9_years                  float64
income_composition_of_resources     float64
schooling                           float64
dtype: object

## Partitioning Dataset

Get features only and responses/targets only.

In [58]:
# Take everything except the response var.
X = df.loc[:, df.columns != "life_expecancy"]

# Now assign response variable only.
y = df["life_expectancy"]

# 67% / 33% SPLIT
# Random state is for seeding a partition. 0 - 42 are common seeds.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state=42)

## Reducing Dataset

Now that we have a train-test-split dataset, we can reduce it by removing missing and non-numerical data.

In [None]:
def remove_missing_and_nonNumerical_values(X, y):
    """
    Removes missing and non-numerical values from dataframe 'X' and corresponding reponse variable y. 
    Prints minor statistics such as original shapes, categorical variables removed, and new sizes.

    Parameters
    ------
    X (pd.DataFrame): Original dataframe. Numerical and non-numerical datatypes must be pre-assigned as 
    numerical type and category.
    y (pd.DataFrame): Original response variable.

    Returns
    ------
    X (pd.DataFrame): Reduced dataframe with non nans or categorical column variables.
    y (pd.DataFrame): Reduced response variable matching corresponding rows of reduced dataframe 'X'.
    """
    print(f"Original Size X: {X.shape} y: {y.shape}")

    # Drop Categorical Values. Nonnumerical Dtypes must be 
    categorical_columns = X.dtypes[X.dtypes=="category"].index.values
    X = X.drop(columns=categorical_columns)
    print("Removed {}".format(categorical_columns))

    # Drop Missing Values.
    X = X.dropna()
    # Make sure you reduce y as well, since the above is a row-reduction technique.
    y = y[X.index]
    print(f"New Size X: {X.shape} y: {y.shape}")
    return X, y

In [60]:
X_train, y_train = remove_missing_and_nonNumerical_values(X_train, y_train)
X_test, y_test = remove_missing_and_nonNumerical_values(X_test, y_test)

Original Size ((1968, 22), (1968,))
Removed ['country' 'status']
New Size ((1123, 20), (1123,))
Original Size ((970, 22), (970,))
Removed ['country' 'status']
New Size ((526, 20), (526,))
