# Intermediate-Statistics (Demo)
1.One -Way ANOVA

2.F-Test

3.Correlation

4.Chi-Squared Test (Contingency; Confusion Table)

5.Linear Regression

* 5.1.Student Alcohol Consumption prediction 
* 5.2.Forward, Backward, Stepwise

6.Logistic Regression (only numeric inputs)
* 6.1.Direct Target Marketing : Purchase(Yes/No)

7.MonteCarlo (Optimization)

![](https://media0.giphy.com/media/9ADoZQgs0tyww/giphy.gif)

In [83]:
!pip install -U mlmachine==0.1.5 &> /dev/null

In [84]:
#Importing required libraries
#Importing the required libraries and data set 
import numpy as np
import seaborn as sns
import pandas as pd
from matplotlib import pyplot
import matplotlib.pyplot as plt
import joblib
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

import statsmodels.api as sm
from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings('ignore')

print("Important libraries loaded successfully")

Important libraries loaded successfully


# 1.One -Way ANOVA

In [85]:
A = [12.6, 12, 11.8, 11.9, 13, 12.5, 14]
B = [10, 10.2, 10, 12, 14, 13]
C = [10.1, 13, 13.4, 12.9, 8.9, 10.7, 13.6, 12]

In [86]:
all_salaryK = A + B + C
company_names = (['A'] * len(A)) +  (['B'] * len(B)) +  (['C'] * len(C))

In [87]:
data = pd.DataFrame({'company': company_names, 'salaryK': all_salaryK})

In [88]:
data.head(20)

Unnamed: 0,company,salaryK
0,A,12.6
1,A,12.0
2,A,11.8
3,A,11.9
4,A,13.0
5,A,12.5
6,A,14.0
7,B,10.0
8,B,10.2
9,B,10.0


In [89]:
data.groupby('company').mean()

Unnamed: 0_level_0,salaryK
company,Unnamed: 1_level_1
A,12.542857
B,11.533333
C,11.825


In [90]:
lm = ols('salaryK ~ company',data=data).fit()
table = sm.stats.anova_lm(lm)
print(table)

            df     sum_sq   mean_sq         F    PR(>F)
company    2.0   3.606905  1.803452  0.821297  0.455683
Residual  18.0  39.525476  2.195860       NaN       NaN


#2.F-Test (Continue from Demo 1)

We use the Mean Squares to compute the F statistic as the ratio between explained vs unexplained variation:

![](https://miro.medium.com/max/221/1*IOV_eIIQHXZq4taC0lLn2A.png)


In [91]:
data.head(20)

Unnamed: 0,company,salaryK
0,A,12.6
1,A,12.0
2,A,11.8
3,A,11.9
4,A,13.0
5,A,12.5
6,A,14.0
7,B,10.0
8,B,10.2
9,B,10.0


In [92]:
# compute overall mean
overall_mean = data['salaryK'].mean()
overall_mean

11.980952380952381

In [93]:
# compute Sum of Squares Total (SST)
data['overall_mean'] = overall_mean
ss_total = sum((data['salaryK'] - data['overall_mean'])**2)
ss_total

43.132380952380956

In [94]:
# compute group means
group_means = data.groupby('company').mean()
group_means = group_means.rename(columns = {'salaryK': 'group_mean'})
group_means

Unnamed: 0_level_0,group_mean,overall_mean
company,Unnamed: 1_level_1,Unnamed: 2_level_1
A,12.542857,11.980952
B,11.533333,11.980952
C,11.825,11.980952


In [95]:
# add group means and overall mean to the original data frame
data = data.merge(group_means, left_on = 'company', right_index = True)
data

Unnamed: 0,company,salaryK,overall_mean_x,group_mean,overall_mean_y
0,A,12.6,11.980952,12.542857,11.980952
1,A,12.0,11.980952,12.542857,11.980952
2,A,11.8,11.980952,12.542857,11.980952
3,A,11.9,11.980952,12.542857,11.980952
4,A,13.0,11.980952,12.542857,11.980952
5,A,12.5,11.980952,12.542857,11.980952
6,A,14.0,11.980952,12.542857,11.980952
7,B,10.0,11.980952,11.533333,11.980952
8,B,10.2,11.980952,11.533333,11.980952
9,B,10.0,11.980952,11.533333,11.980952


In [96]:
# compute Sum of Squares Residual (SSE)
ss_residual = sum((data['salaryK'] - data['group_mean'])**2)
ss_residual

39.52547619047619

In [97]:
# compute Sum of Squares Model (SSM)
overall_mean = data['salaryK'].mean()
ss_explained = sum((overall_mean - data['group_mean'])**2)
ss_explained

3.6069047619047776

In [98]:
# compute Mean Square Residual (MSE)
n_groups = len(set(data['company']))
n_obs = data.shape[0]
df_residual = n_obs - n_groups
ms_residual = ss_residual / df_residual
ms_residual

2.1958597883597886

In [99]:
# compute Mean Square Explained (MSM)
df_explained = n_groups - 1
ms_explained = ss_explained / df_explained
ms_explained

1.8034523809523888

In [100]:
f = ms_explained / ms_residual
f

0.8212966923081592

In [101]:
# compute p-value (Optional)
import scipy.stats
p_value = 1 - scipy.stats.f.cdf(f, df_explained, df_residual)
p_value

0.4556832940515221

#3.Correlation


![](http://moziru.com/images/hosue-clipart-sold-1.png)

Attribute Information:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in $1000's




In [102]:
houseprice_df = pd.read_csv('https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/boston_house.csv', sep=';')

In [103]:
houseprice_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,B,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,396.9,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,396.9,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,392.83,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,394.63,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,396.9,36.2


In [104]:
houseprice_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   CRIM    506 non-null    float64
 1   ZN      506 non-null    float64
 2   INDUS   506 non-null    float64
 3   CHAS    506 non-null    int64  
 4   NOX     506 non-null    float64
 5   RM      506 non-null    float64
 6   AGE     506 non-null    float64
 7   DIS     506 non-null    float64
 8   RAD     506 non-null    int64  
 9   TAX     506 non-null    int64  
 10  B       506 non-null    float64
 11  PRICE   452 non-null    float64
dtypes: float64(9), int64(3)
memory usage: 47.6 KB


In [105]:
houseprice_df.dropna(inplace=True)
houseprice_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 452 entries, 0 to 505
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   CRIM    452 non-null    float64
 1   ZN      452 non-null    float64
 2   INDUS   452 non-null    float64
 3   CHAS    452 non-null    int64  
 4   NOX     452 non-null    float64
 5   RM      452 non-null    float64
 6   AGE     452 non-null    float64
 7   DIS     452 non-null    float64
 8   RAD     452 non-null    int64  
 9   TAX     452 non-null    int64  
 10  B       452 non-null    float64
 11  PRICE   452 non-null    float64
dtypes: float64(9), int64(3)
memory usage: 45.9 KB


In [106]:
# correlation
houseprice_df.corr(method='pearson')

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,B,PRICE
CRIM,1.0,-0.281239,0.573822,0.050066,0.636941,-0.142458,0.447664,-0.461967,0.897988,0.825668,-0.413014,-0.286245
ZN,-0.281239,1.0,-0.514273,-0.059697,-0.500729,0.306623,-0.555679,0.656152,-0.266606,-0.269446,0.15038,0.33157
INDUS,0.573822,-0.514273,1.0,0.103016,0.738517,-0.364895,0.605982,-0.669362,0.513306,0.673312,-0.316752,-0.411915
CHAS,0.050066,-0.059697,0.103016,1.0,0.134064,0.07658,0.123141,-0.140662,0.057337,0.016894,0.013415,0.154409
NOX,0.636941,-0.500729,0.738517,0.134064,1.0,-0.264594,0.707148,-0.745812,0.542499,0.61512,-0.358433,-0.332778
RM,-0.142458,0.306623,-0.364895,0.07658,-0.264594,1.0,-0.187871,0.138774,-0.095931,-0.214948,0.108352,0.740181
AGE,0.447664,-0.555679,0.605982,0.123141,0.707148,-0.187871,1.0,-0.720334,0.359326,0.427095,-0.223765,-0.299893
DIS,-0.461967,0.656152,-0.669362,-0.140662,-0.745812,0.138774,-0.720334,1.0,-0.388385,-0.444131,0.234449,0.138798
RAD,0.897988,-0.266606,0.513306,0.057337,0.542499,-0.095931,0.359326,-0.388385,1.0,0.872876,-0.352586,-0.217902
TAX,0.825668,-0.269446,0.673312,0.016894,0.61512,-0.214948,0.427095,-0.444131,0.872876,1.0,-0.367084,-0.345898


In [107]:
houseprice_df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'B', 'PRICE'],
      dtype='object')

In [108]:
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'B', 'PRICE']

In [109]:
import numpy as np
import scipy.stats

x = houseprice_df['RM']
y = houseprice_df['PRICE']
r,p = scipy.stats.pearsonr(x, y)
print(r,p)
r,p = scipy.stats.spearmanr(x, y)
print(r,p)
r,p = scipy.stats.kendalltau(x, y)
print(r,p)

0.740180804891272 1.3795164987697172e-79
0.6847898692170199 7.875596071257243e-64
0.523926690105912 7.613661296295075e-62


#4.Chi-Squared Test (Contingency; Confusion Table)


In [110]:
!pip install researchpy



In [111]:
import researchpy as rp
from scipy import stats

df = pd.read_csv("https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/mental-heath-in-tech.csv")

In [112]:
df.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",Do you think that discussing a mental health disorder with your employer would have negative consequences?,Do you think that discussing a physical health issue with your employer would have negative consequences?,Would you feel comfortable discussing a mental health disorder with your coworkers?,Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?,Do you feel that your employer takes mental health as seriously as physical health?,Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?,Do you have medical coverage (private insurance or state-provided) which includes treatment of mental health issues?,Do you know local or online resources to seek help for a mental health disorder?,"If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?","If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?","If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?","If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?",Do you believe your productivity is ever affected by a mental health issue?,"If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?",Do you have previous employers?,Have your previous employers provided mental health benefits?,Were you aware of the options for mental health care provided by your previous employers?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health issues and how to seek help?,Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Do you think that discussing a mental health disorder with previous employers would have negative consequences?,Do you think that discussing a physical health issue with previous employers would have negative consequences?,Would you have been willing to discuss a mental health issue with your previous co-workers?,Would you have been willing to discuss a mental health issue with your direct supervisor(s)?,Did you feel that your previous employers took mental health as seriously as physical health?,Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?,Would you be willing to bring up a physical health issue with a potential employer in an interview?,Why or why not?,Would you bring up a mental health issue with a potential employer in an interview?,Why or why not?.1,Do you feel that being identified as a person with a mental health issue would hurt your career?,Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?,How willing would you be to share with friends and family that you have a mental illness?,Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?,Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?,Do you have a family history of mental illness?,Have you had a mental health disorder in the past?,Do you currently have a mental health disorder?,"If yes, what condition(s) have you been diagnosed with?","If maybe, what condition(s) do you believe you have?",Have you been diagnosed with a mental health condition by a medical professional?,"If so, what condition(s) were you diagnosed with?",Have you ever sought treatment for a mental health issue from a mental health professional?,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,No,No,Maybe,Yes,I don't know,No,,,,,,,,,1,"No, none did",N/A (not currently aware),I don't know,None did,I don't know,Some of them,None of them,Some of my previous employers,Some of my previous employers,I don't know,None of them,Maybe,,Maybe,,Maybe,"No, I don't think they would",Somewhat open,No,,No,Yes,No,,,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",0,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,No,No,Maybe,Yes,Yes,No,,,,,,,,,1,"Yes, they all did",I was aware of some,None did,Some did,"Yes, always",None of them,None of them,"No, at none of my previous employers",Some of my previous employers,Some did,None of them,Maybe,It would depend on the health issue. If there ...,No,While mental health has become a more prominen...,"No, I don't think it would","No, I don't think they would",Somewhat open,No,,Yes,Yes,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",1,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,Maybe,No,Maybe,Maybe,I don't know,No,,,,,,,,,1,"No, none did",N/A (not currently aware),None did,Some did,I don't know,I don't know,Some of them,Some of my previous employers,I don't know,I don't know,Some of them,Yes,"They would provable need to know, to Judge if ...",Yes,"Stigma, mainly.",Maybe,Maybe,Somewhat open,Maybe/Not sure,Yes,No,Maybe,No,,,No,,1,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,,,,,,,1.0,"Yes, I know several","Sometimes, if it comes up",I'm not sure,"Sometimes, if it comes up",I'm not sure,Yes,1-25%,1,Some did,N/A (not currently aware),None did,None did,I don't know,Some of them,Some of them,Some of my previous employers,Some of my previous employers,I don't know,Some of them,Yes,"old back injury, doesn't cause me many issues ...",Maybe,would not if I was not 100% sure that the disc...,"Yes, I think it would",Maybe,Neutral,No,,No,Yes,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",1,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,Yes,Maybe,Maybe,No,No,No,,,,,,,,,1,I don't know,N/A (not currently aware),Some did,None did,I don't know,Some of them,Some of them,"No, at none of my previous employers",Some of my previous employers,Some did,Some of them,Maybe,Depending on the interview stage and whether I...,No,I don't know,"Yes, I think it would",Maybe,Somewhat open,"Yes, I experienced",Yes,Yes,Yes,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",,Yes,"Anxiety Disorder (Generalized, Social, Phobia,...",1,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes


In [113]:
rp.summary_cat(df[['Do you currently have a mental health disorder?', 'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?' ]])

Unnamed: 0,Variable,Outcome,Count,Percent
0,Do you currently have a mental health disorder?,Yes,575,40.13
1,,No,531,37.06
2,,Maybe,327,22.82
3,Would you have been willing to discuss a menta...,Some of my previous employers,654,51.74
4,,"No, at none of my previous employers",416,32.91
5,,I don't know,101,7.99
6,,"Yes, at all of my previous employers",93,7.36


In [114]:
def drop_maybe(series):
    if series.lower() == 'yes' or series.lower() == 'no':
        return series
    else:
        return

In [115]:
df['current_mental_disorder'] = df['Do you currently have a mental health disorder?'].apply(drop_maybe)
df['willing_discuss_mh_supervisor'] = df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?']
rp.summary_cat(df[['current_mental_disorder', 'willing_discuss_mh_supervisor']])

Unnamed: 0,Variable,Outcome,Count,Percent
0,current_mental_disorder,Yes,575,51.99
1,,No,531,48.01
2,willing_discuss_mh_supervisor,Some of my previous employers,654,51.74
3,,"No, at none of my previous employers",416,32.91
4,,I don't know,101,7.99
5,,"Yes, at all of my previous employers",93,7.36


In [116]:
pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])

current_mental_disorder,No,Yes
willing_discuss_mh_supervisor,Unnamed: 1_level_1,Unnamed: 2_level_1
I don't know,51,29
"No, at none of my previous employers",119,194
Some of my previous employers,237,267
"Yes, at all of my previous employers",51,24


In [117]:
crosstab = pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])
crosstab

current_mental_disorder,No,Yes
willing_discuss_mh_supervisor,Unnamed: 1_level_1,Unnamed: 2_level_1
I don't know,51,29
"No, at none of my previous employers",119,194
Some of my previous employers,237,267
"Yes, at all of my previous employers",51,24


In [118]:
stats.chi2_contingency(crosstab)

(32.408194625396376,
 4.292859793048239e-07,
 3,
 array([[ 37.69547325,  42.30452675],
        [147.48353909, 165.51646091],
        [237.48148148, 266.51851852],
        [ 35.33950617,  39.66049383]]))

In [119]:
table, results = rp.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], prop= 'col', test= 'chi-square')
table

Unnamed: 0_level_0,current_mental_disorder,current_mental_disorder,current_mental_disorder
current_mental_disorder,No,Yes,All
willing_discuss_mh_supervisor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
I don't know,11.14,5.64,8.23
"No, at none of my previous employers",25.98,37.74,32.2
Some of my previous employers,51.75,51.95,51.85
"Yes, at all of my previous employers",11.14,4.67,7.72
All,100.0,100.0,100.0


In [120]:
results

Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 3.0) =,32.4082
1,p-value =,0.0
2,Cramer's V =,0.1826


#5.Linear Regression (Play with Student Alcohol Consumption Corpus)


Reference of Data Set: https://www.kaggle.com/uciml/student-alcohol-consumption

 ![](https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-15/sh0.08/e35/c180.0.1079.1079a/s640x640/70199188_954316201572050_3255215538695518813_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&_nc_cat=102&_nc_ohc=L2qCNIf88XQAX_ULy7_&oh=dc15ac7019db6a0f10b050e09cb163c7&oe=5ECB96C9)

In [121]:
# Importing and combining datasets
df1 = pd.read_csv("https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/student-mat.csv")
df2 = pd.read_csv("https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/student-por.csv")
df3 = pd.concat([df1,df2])
df3.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


In [122]:
df3=df3.drop_duplicates(["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"])
df3.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

In [123]:
df3.info()
#Drop the columns which is not essentials for grade prediction
df3 = df3.drop(['famsize', 'Pstatus', 'Fjob', 'Mjob'],axis=1)
df3 = df3.drop(['reason','traveltime', 'studytime', 'failures'],axis=1)
df3 = df3.drop(['schoolsup','famsup', 'paid', 'nursery', 'internet', 'freetime'],axis=1)
df3 = df3.drop(['higher', 'health'],axis=1)
df3.columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 662 entries, 0 to 642
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      662 non-null    object
 1   sex         662 non-null    object
 2   age         662 non-null    int64 
 3   address     662 non-null    object
 4   famsize     662 non-null    object
 5   Pstatus     662 non-null    object
 6   Medu        662 non-null    int64 
 7   Fedu        662 non-null    int64 
 8   Mjob        662 non-null    object
 9   Fjob        662 non-null    object
 10  reason      662 non-null    object
 11  guardian    662 non-null    object
 12  traveltime  662 non-null    int64 
 13  studytime   662 non-null    int64 
 14  failures    662 non-null    int64 
 15  schoolsup   662 non-null    object
 16  famsup      662 non-null    object
 17  paid        662 non-null    object
 18  activities  662 non-null    object
 19  nursery     662 non-null    object
 20  higher    

Index(['school', 'sex', 'age', 'address', 'Medu', 'Fedu', 'guardian',
       'activities', 'romantic', 'famrel', 'goout', 'Dalc', 'Walc', 'absences',
       'G1', 'G2', 'G3'],
      dtype='object')

In [124]:
df3.describe()

Unnamed: 0,age,Medu,Fedu,famrel,goout,Dalc,Walc,absences,G1,G2,G3
count,662.0,662.0,662.0,662.0,662.0,662.0,662.0,662.0,662.0,662.0,662.0
mean,16.812689,2.492447,2.293051,3.938066,3.172205,1.504532,2.282477,4.930514,10.728097,10.708459,10.725076
std,1.269194,1.130958,1.094027,0.94193,1.161907,0.926567,1.290121,6.85806,3.082098,3.529588,4.106738
min,15.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,1.0,4.0,2.0,1.0,1.0,0.0,8.0,9.0,9.0
50%,17.0,2.0,2.0,4.0,3.0,1.0,2.0,3.0,10.0,11.0,11.0
75%,18.0,4.0,3.0,5.0,4.0,2.0,3.0,8.0,13.0,13.0,13.0
max,22.0,4.0,4.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


In [125]:
df3.corr()

Unnamed: 0,age,Medu,Fedu,famrel,goout,Dalc,Walc,absences,G1,G2,G3
age,1.0,-0.119031,-0.133645,-0.008453,0.094744,0.130655,0.06932,0.168313,-0.113206,-0.130408,-0.127446
Medu,-0.119031,1.0,0.635156,0.010211,0.022866,-0.018013,-0.036381,0.095703,0.208172,0.202776,0.182611
Fedu,-0.133645,0.635156,1.0,0.016171,0.053071,-0.008774,0.03237,0.068048,0.185636,0.156541,0.137159
famrel,-0.008453,0.010211,0.016171,1.0,0.091317,-0.069881,-0.092647,-0.067647,0.060893,0.052807,0.061296
goout,0.094744,0.022866,0.053071,0.091317,1.0,0.252217,0.404503,0.025046,-0.093786,-0.09103,-0.081691
Dalc,0.130655,-0.018013,-0.008774,-0.069881,0.252217,1.0,0.618431,0.125517,-0.112935,-0.084944,-0.075212
Walc,0.06932,-0.036381,0.03237,-0.092647,0.404503,0.618431,1.0,0.131147,-0.131321,-0.10282,-0.084689
absences,0.168313,0.095703,0.068048,-0.067647,0.025046,0.125517,0.131147,1.0,-0.050925,-0.04265,-0.004601
G1,-0.113206,0.208172,0.185636,0.060893,-0.093786,-0.112935,-0.131321,-0.050925,1.0,0.841436,0.796569
G2,-0.130408,0.202776,0.156541,0.052807,-0.09103,-0.084944,-0.10282,-0.04265,0.841436,1.0,0.913548


In [126]:
#Getting dummies
df3 = pd.get_dummies(df3, drop_first=True)
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 662 entries, 0 to 642
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              662 non-null    int64
 1   Medu             662 non-null    int64
 2   Fedu             662 non-null    int64
 3   famrel           662 non-null    int64
 4   goout            662 non-null    int64
 5   Dalc             662 non-null    int64
 6   Walc             662 non-null    int64
 7   absences         662 non-null    int64
 8   G1               662 non-null    int64
 9   G2               662 non-null    int64
 10  G3               662 non-null    int64
 11  school_MS        662 non-null    uint8
 12  sex_M            662 non-null    uint8
 13  address_U        662 non-null    uint8
 14  guardian_mother  662 non-null    uint8
 15  guardian_other   662 non-null    uint8
 16  activities_yes   662 non-null    uint8
 17  romantic_yes     662 non-null    uint8
dtypes: int64(1

In [127]:
# define target variable and training and test sets
X = df3.drop("G3",axis=1)
Y = df3["G3"]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2020)

In [128]:
X.shape

(662, 17)

##5.1. Student Alcohol Consumption prediction 


In [129]:
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

##5.2.Forward, Backward, Stepwise


###5.2.1 Backward selection with statsmodels

In [130]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Building Optimal Model using Backward Elimination
import statsmodels.api as sm
X_opt = X
regressor_OLS = sm.OLS(endog =Y, exog = X_opt).fit()
regressor_OLS.summary()

#Backward Eliminiation Process
#Drop the variable which is not significant(p>0.05)
X_opt = X.drop(['goout','activities_yes', 'address_U', 'school_MS', 'sex_M', 'guardian_mother'], axis=1)
regressor_OLS = sm.OLS(endog =Y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,G3,R-squared (uncentered):,0.98
Model:,OLS,Adj. R-squared (uncentered):,0.979
Method:,Least Squares,F-statistic:,2835.0
Date:,"Tue, 01 Feb 2022",Prob (F-statistic):,0.0
Time:,13:20:43,Log-Likelihood:,-1267.6
No. Observations:,662,AIC:,2557.0
Df Residuals:,651,BIC:,2607.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
age,-0.0663,0.023,-2.903,0.004,-0.111,-0.021
Medu,-0.0139,0.075,-0.186,0.853,-0.161,0.133
Fedu,-0.0702,0.077,-0.914,0.361,-0.221,0.081
famrel,0.0477,0.067,0.711,0.478,-0.084,0.179
Dalc,-0.0033,0.090,-0.037,0.971,-0.180,0.173
Walc,0.0278,0.065,0.428,0.668,-0.100,0.155
absences,0.0266,0.010,2.721,0.007,0.007,0.046
G1,0.1365,0.039,3.501,0.000,0.060,0.213
G2,0.9602,0.034,28.195,0.000,0.893,1.027

0,1,2,3
Omnibus:,423.93,Durbin-Watson:,1.798
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4336.206
Skew:,-2.782,Prob(JB):,0.0
Kurtosis:,14.236,Cond. No.,92.3


###5.2.2 Forward selection with mlxtend

In [131]:
X.shape

(662, 17)

In [132]:
# Build step forward feature selection
sfs1 = sfs(regressor,
           k_features=5,
           forward=True,
           floating=False,
           verbose=2,
           scoring='neg_mean_squared_error',
           cv=10)

# Perform SFS
sfs1 = sfs1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.3s finished

[2022-02-01 13:20:44] Features: 1/5 -- score: -2.8968026915784693[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    0.3s finished

[2022-02-01 13:20:44] Features: 2/5 -- score: -2.850989029263633[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.2s finished

[2022-02-01 13:20:44] Features: 3/5 -- score: -2.8304810309710846[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   

In [133]:
# Which features?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

[7, 8, 9, 10, 16]


In [134]:
X_train.iloc[:, feat_cols].head()

Unnamed: 0,absences,G1,G2,school_MS,romantic_yes
369,12,10,10,0,1
223,0,12,13,0,0
179,10,8,7,0,0
106,8,7,8,0,0
558,0,8,13,1,1


In [135]:
# Build full model with selected features
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

regressor = LinearRegression()
regressor.fit(X_train.iloc[:, feat_cols], y_train)
y_test_pred = regressor.predict(X_test.iloc[:, feat_cols])
print('Testing RMSE on selected features: %.3f' % rmse(y_test, y_test_pred))

Testing RMSE on selected features: 1.581


In [136]:
# Build full model on ALL features, for comparison
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_test_pred = regressor.predict(X_test)
print('Testing RMSE on all features: %.3f' % rmse(y_test, y_test_pred))

Testing RMSE on all features: 1.620


#6.Logistic Regression (only numeric inputs)


In [137]:
ds = pd.read_csv('https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/advertising.csv')

In [138]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB


In [139]:
ds.head(5)

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


In [140]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [141]:
y = ds['Clicked on Ad']
X = ds[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Male']]

In [142]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [143]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

###6.1.Direct Target Marketing : Purchase(Yes/No)


In [144]:
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [145]:
print(classification_report(y_test, y_pred, digits=4))
print('\n')
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0     0.9712    0.9926    0.9818       136
           1     0.9910    0.9649    0.9778       114

    accuracy                         0.9800       250
   macro avg     0.9811    0.9788    0.9798       250
weighted avg     0.9802    0.9800    0.9800       250



[[135   1]
 [  4 110]]


In [146]:
model.coef_

array([[-2.48433792,  1.03334826, -1.47633287, -2.37579962, -0.12886094]])

In [147]:
model.intercept_

array([1.48849751])

#7.McNemar's Test
credit: http://rasbt.github.io/mlxtend/user_guide/evaluate/mcnemar/?fbclid=IwAR1BHAgNGhUYQZF1gx9g7lo_Q8FGWcCGUxI5W3kzmhbwmecHMSO15Rpj2jA

![](https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/img/w2_ex1.png)

In [148]:
import numpy as np
from mlxtend.evaluate import mcnemar_table

# The correct target (class) labels
y_target = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# Class labels predicted by model 1
y_model1 = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0])

# Class labels predicted by model 2
y_model2 = np.array([0, 0, 1, 1, 0, 1, 1, 0, 0, 0])

tb = mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)

print(tb)

[[4 2]
 [1 3]]


![](https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/img/w2_ex5.png)

![](https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/img/w2_ex2.png)

In [149]:
import numpy as np

tb_b = np.array([[9945, 25],
                 [15, 15]])

In [150]:
from mlxtend.evaluate import mcnemar

chi2, p = mcnemar(ary=tb_b)
print('chi-squared:', chi2)
print('p-value:', p)

chi-squared: 2.025
p-value: 0.15472892348537437


![](https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/img/w2_ex3.png)

![](https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/img/w2_ex4.png)

In [151]:
from mlxtend.evaluate import mcnemar
import numpy as np

tb_a = np.array([[9959, 11],
                 [1, 29]])

chi2, p = mcnemar(ary=tb_a, exact=True)

print('chi-squared:', chi2)
print('p-value:', p)

chi-squared: 1
p-value: 0.00634765625


#8.MonteCarlo (Optimization)

credit: https://pbpython.com/monte-carlo.html

![](https://pbpython.com/images/monte_carlo_cover.jpg)

In [152]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_style('whitegrid')

In [153]:
avg = 1
std_dev = .1
num_reps = 500
num_simulations = 1000

In [154]:
pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)

In [155]:
sales_target_values = [75_000, 100_000, 200_000, 300_000, 400_000, 500_000]
sales_target_prob = [.3, .3, .2, .1, .05, .05]
sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)

In [156]:
df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
                                               'Sales_Target': sales_target})

df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']

In [157]:
df.head()

Unnamed: 0,Pct_To_Target,Sales_Target,Sales
0,0.95,75000,71250.0
1,1.01,300000,303000.0
2,0.89,75000,66750.0
3,1.03,100000,103000.0
4,0.94,100000,94000.0


In [158]:
df['Sales_Target'].value_counts().sort_index()

75000     149
100000    157
200000    100
300000     40
400000     22
500000     32
Name: Sales_Target, dtype: int64

In [159]:
# sales_target_prob = [.3, .3, .2, .1, .05, .05]
df['Sales_Target'].value_counts(normalize=True).sort_index()

75000     0.298
100000    0.314
200000    0.200
300000    0.080
400000    0.044
500000    0.064
Name: Sales_Target, dtype: float64

In [160]:
def calc_commission_rate(x):
    """ Return the commission rate based on the table:
    0-90% = 2%
    91-99% = 3%
    >= 100 = 4%
    """
    if x <= .90:
        return .02
    if x <= .99:
        return .03
    else:
        return .04

In [161]:
df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']
# Define a list to keep all the results from each simulation that we want to analyze
all_stats = []

# Loop through many simulations
for i in range(num_simulations):

    # Choose random inputs for the sales targets and percent to target
    sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)
    pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)

    # Build the dataframe based on the inputs and number of reps
    df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
                                                   'Sales_Target': sales_target})

    # Back into the sales number using the percent to target rate
    df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']

    # Determine the commissions rate and calculate it
    df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
    df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']

    # We want to track sales,commission amounts and sales targets over all the simulations
    all_stats.append([df['Sales'].sum().round(0),
                      df['Commission_Amount'].sum().round(0),
                      df['Sales_Target'].sum().round(0)])

In [162]:
results_df = pd.DataFrame.from_records(all_stats, columns=['Sales',
                                                           'Commission_Amount',
                                                           'Sales_Target'])

In [163]:
results_df.describe().style.format('{:,}')

Unnamed: 0,Sales,Commission_Amount,Sales_Target
count,1000.0,1000.0,1000.0
mean,83769385.0,2860169.972,83767175.0
std,2700648.74338744,99563.63876786496,2683851.7012549946
min,74660250.0,2530490.0,74800000.0
25%,81843000.0,2789610.0,81850000.0
50%,83769125.0,2859493.5,83812500.0
75%,85584250.0,2926779.75,85475000.0
max,92894250.0,3169250.0,92775000.0
