# Challenge 2 - Logistic Regression

# Before your start:

    Read the README.md file
    Comment as much as you can and use the resources (README.md file)
    Happy learning!

This exercise is very similar to Challenge 1, except that here we are working on a 
Classification Problem using Logistic Regression.

In this lab exercise we will work with Customer-churn.csv data set. 
You can find a copy of the dataset in the git hub folder. 
The objective of this exercise is to predict whether the customer will churn or not. 
Please follow the steps and provide your code along with comments.

In [1]:
# Import the libraries for loading the data set 
# Load the dataset 

import pandas as pd
import numpy as np

data = pd.read_csv('Customer-Churn.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
# You can use the warnings library to ignore warnings that might show when you run the code

import warnings
warnings.filterwarnings('ignore')

### In this exercise, we are not going to use all the variables in the dataset to make the predictions. Here is the list of the numerical and categorical variables that we will use:
    
    Numerical : Monthly Charges, Total Charges, tenure
    Categorical : Gender, Senior Citizen, Partner, Dependent, Contract

Numerical and Categorical together make up the predictor variables 
Target variable is "Churn"

### Data Pre-processing (Handling Numerical variables)

First we will perform the data pre-processing operations on the specified numerical varibles

In [3]:
# Store the specified numerical columns data as a separate dataframe. Give it the name "numerics"

numeric_cols = ['MonthlyCharges', 'TotalCharges', 'tenure']

numerics = data[numeric_cols]

### MinMax Scaler

Hint: Since we are using "numerics" to store the nummerical variables we can pass "numerics" directly
as MinMaxScaler().fit(numerics)

In [4]:
# Import the required library
# Perform the scaling and store the results inside "numerical"

from sklearn.preprocessing import MinMaxScaler


numerics['TotalCharges'] = np.where(numerics['TotalCharges'].str.contains('\.'), numerics['TotalCharges'], numerics['TotalCharges'] + ".0")
numerics['TotalCharges'] = numerics['TotalCharges'].astype('float')


transformer = MinMaxScaler().fit(numerics[numeric_cols])
numerical = transformer.transform(numerics[numeric_cols])

In [5]:
# Convert "numerical" into a dataframe so that it can be used later with the dataframe of categorical variables

numerical = pd.DataFrame(numerical, columns=numeric_cols)
numerical.head()

Unnamed: 0,MonthlyCharges,TotalCharges,tenure
0,0.115423,0.003437,0.013889
1,0.385075,0.217564,0.472222
2,0.354229,0.012453,0.027778
3,0.239303,0.211951,0.625
4,0.521891,0.017462,0.027778


### Data Pre-processing (Handling Categorical variables)

In this step we will perform the data pre-processing operations on the specified categorical varibles

In [6]:
# Similar to numerical variables, store the specified categorical columns data as a dataframe. 
# Give it the name "cats"

cat_cols = ['gender','SeniorCitizen', 'Partner', 'Dependents', 'Contract']
cats = data[cat_cols]

In [7]:
# Check if "cats" is actually a dataframe using cats.head(3)

cats.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,Contract
0,Female,0,Yes,No,Month-to-month
1,Male,0,No,No,One year
2,Male,0,No,No,Month-to-month


### Using One Hot Encoding 

In [8]:
# Perform One hot encoding and store the results (one hot encoded dataframe) into "categorical"

categorical = pd.get_dummies(cats, columns=cat_cols)

In [9]:
# Check how the new OHE data looks like using the head() function

categorical.head()

Unnamed: 0,gender_Female,gender_Male,SeniorCitizen_0,SeniorCitizen_1,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year
0,1,0,1,0,0,1,1,0,1,0,0
1,0,1,1,0,1,0,1,0,0,1,0
2,0,1,1,0,1,0,1,0,1,0,0
3,0,1,1,0,1,0,1,0,0,1,0
4,1,0,1,0,1,0,1,0,1,0,0


Now we have pre-processed our specified numerical and categorical data. 
In this next step we will combine the two dataframes (numerical and categorical)
You can the following code to combine / concatenate the two dataframes

In [10]:
X = pd.concat([numerical,categorical],axis=1)

In [11]:
# Now that we have processed our predictor varibles, we can work towards 
# fitting the multiple regression model on the data 

Y = np.where(data['Churn']=='Yes', 1, 0)

In [12]:
# Import the libraries required for regression model 
# Fit the linear regression model on the data

from sklearn import linear_model

lin_mod = linear_model.LinearRegression()
model = lin_mod.fit(X,Y)

In [13]:
# Make predictions on the dataset, store the results in "predictions"

predictions = lin_mod.predict(X)

In [14]:
# Print the measures of accuracy of the model - MSE, RMSE, and R2 score
# Hint: Use from sklearn.metrics import mean_squared_error, r2_score

from sklearn.metrics import mean_squared_error, r2_score

print(mean_squared_error(Y, predictions))
r2_score(Y, predictions)

0.1460225450835557


0.25096939228384685