In [129]:
import pandas as pd
import numpy as np
np.random.seed(10)

# Reduce the number of rows

Since our data set is too big for the scope of our project, we need to reduce the number of rows in both, the training data set and the test data set. However, we are not going to use the same method in the training and test data sets, since in the first mentioned we want a balanced data frame and in the second one we want to mantain the proportion of the feature to be predicted.

## Reducing the test data set: Random sampling

In order to reduce the test data set, we are selecting random indexs of the original test data frame and deleting them.

In [130]:
#reducing the number of rows to 1/3 of 10000 (since in total we want 10000 rows and training must have 2/3 and test 1/3)
df = pd.read_csv(r'../Datasets/testing-dataset.csv')
df = df[df.filter(regex='^(?!Unnamed)').columns] #deleting unwanted column added 
nRowsToRemove = len(df) - 3333
rowsToDropIndices = np.random.choice(df.index, nRowsToRemove, replace = False)
df_reduced = df.drop(rowsToDropIndices)
df_reduced = df_reduced[df_reduced.filter(regex='^(?!Unnamed)').columns] #deleting unwanted column added 
df_reduced.to_csv(r'../Datasets/testing-dataset-reduced.csv')

# Study if the result is representative
After doing a sample of the original test data set we need to prove that the result is representative and that for each column we have a similar statistical structure to
the original dataset.

## Description of all the features 
With these descriptions we can already observe for the numerical variables if the mean, std, min and max values or others have changed or not. If the values are very similar, we can accept the previous algorithm to reduce the number of rows and proceed with the project.

In [131]:
#study of the  proportions/frequencies of each feature in the original and row-reduced test data set
df.describe(include='all')

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
count,13593.0,13593,13593,13593,13593,13593,13593,13593,13593,13593,...,13593.0,13593.0,13593.0,13593,13593.0,13593.0,13593.0,13593.0,13593.0,13593
unique,,12,4,8,2,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,3417,8212,4031,10731,7083,11271,8598,4529,2837,...,,,,11768,,,,,,12049
mean,40.013316,,,,,,,,,,...,2.569337,961.959317,0.17097,,0.091098,93.579146,-40.475576,3.630456,5167.315795,
std,10.419831,,,,,,,,,,...,2.705298,188.19051,0.497837,,1.567287,0.578728,4.622785,1.732322,72.204364,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


In [132]:
df_reduced.describe(include='all')

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
count,3333.0,3333,3333,3333,3333,3333,3333,3333,3333,3333,...,3333.0,3333.0,3333.0,3333,3333.0,3333.0,3333.0,3333.0,3333.0,3333
unique,,12,4,7,2,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,wed,...,,,,nonexistent,,,,,,no
freq,,892,2011,984,2634,1760,2771,2076,1151,702,...,,,,2906,,,,,,2981
mean,39.833783,,,,,,,,,,...,2.561956,964.439544,0.159616,,0.105281,93.581854,-40.450255,3.649235,5167.968497,
std,10.216708,,,,,,,,,,...,2.736664,182.030587,0.472244,,1.553709,0.577217,4.615595,1.720646,71.070812,
min,18.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.635,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.354,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.876,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


To validate that the categorical variables values have the same representation in both the original and the reduced data sets is not enough to observe the previous description. We must observe that the percentage of each feature value has not changed much.

In [133]:
print ("Job feature percentatges in the original data set")
print(df.job.value_counts(normalize=True))
print ("Job feature percentatges in the reduced data set")
print(df_reduced.job.value_counts(normalize=True))

Job feature percentatges in the original data set
admin.           0.251379
blue-collar      0.223277
technician       0.162804
services         0.098433
management       0.069080
retired          0.044214
entrepreneur     0.034871
self-employed    0.034356
housemaid        0.026852
unemployed       0.024130
student          0.023689
unknown          0.006915
Name: job, dtype: float64
Job feature percentatges in the reduced data set
admin.           0.267627
blue-collar      0.215122
technician       0.149415
services         0.106811
management       0.062706
retired          0.041404
entrepreneur     0.035404
self-employed    0.034203
housemaid        0.030003
unemployed       0.028503
student          0.021602
unknown          0.007201
Name: job, dtype: float64


In [134]:
print("Difference between the percentage of each 'job' value in the original test data set and the reduced data set")
print(df.job.value_counts(normalize=True)-df_reduced.job.value_counts(normalize=True))

Difference between the percentage of each 'job' value in the original test data set and the reduced data set
admin.          -0.016247
blue-collar      0.008155
technician       0.013389
services        -0.008378
management       0.006373
retired          0.002810
entrepreneur    -0.000533
self-employed    0.000152
housemaid       -0.003151
unemployed      -0.004373
student          0.002087
unknown         -0.000285
Name: job, dtype: float64


In [135]:
print("Difference between the percentage of each 'marital' value in the original test data set and the reduced data set")
print(df.marital.value_counts(normalize=True)-df_reduced.marital.value_counts(normalize=True))

Difference between the percentage of each 'marital' value in the original test data set and the reduced data set
married     0.000774
single      0.005966
divorced   -0.006553
unknown    -0.000187
Name: marital, dtype: float64


In [136]:
print("Difference between the percentage of each 'education' value in the original test data set and the reduced data set")
print(df.education.value_counts(normalize=True)-df_reduced.education.value_counts(normalize=True))

Difference between the percentage of each 'education' value in the original test data set and the reduced data set
basic.4y              -0.002129
basic.6y               0.003169
basic.9y               0.006357
high.school           -0.013830
illiterate                  NaN
professional.course    0.008643
university.degree      0.001320
unknown               -0.003824
Name: education, dtype: float64


Here we can observe that we have lost the illiterate value in our test-reduced dataset. If we study what this education value percentatge was, we can observe that it was very low so it is normal and not alarming that in the reduced version of the data set is not represented.

In [137]:
print(df.education.value_counts(normalize=True))

university.degree      0.296550
high.school            0.235195
basic.9y               0.150372
professional.course    0.121754
basic.4y               0.101081
basic.6y               0.052674
unknown                0.042080
illiterate             0.000294
Name: education, dtype: float64


In [138]:
print("Difference between the percentage of each 'default' value in the original test data set and the reduced data set")
print(df.default.value_counts(normalize=True)-df_reduced.default.value_counts(normalize=True))

Difference between the percentage of each 'default' value in the original test data set and the reduced data set
no        -0.000829
unknown    0.000829
Name: default, dtype: float64


In [139]:
print("Difference between the percentage of each 'housing' value in the original test data set and the reduced data set")
print(df.housing.value_counts(normalize=True)-df_reduced.housing.value_counts(normalize=True))

Difference between the percentage of each 'housing' value in the original test data set and the reduced data set
yes       -0.006976
no         0.004470
unknown    0.002506
Name: housing, dtype: float64


In [140]:
print("Difference between the percentage of each 'loan' value in the original test data set and the reduced data set")
print(df.loan.value_counts(normalize=True)-df_reduced.loan.value_counts(normalize=True))

Difference between the percentage of each 'loan' value in the original test data set and the reduced data set
no        -0.002206
yes       -0.000300
unknown    0.002506
Name: loan, dtype: float64


In [141]:
print("Difference between the percentage of each 'contact' value in the original test data set and the reduced data set")
print(df.contact.value_counts(normalize=True)-df_reduced.contact.value_counts(normalize=True))

Difference between the percentage of each 'contact' value in the original test data set and the reduced data set
cellular     0.009669
telephone   -0.009669
Name: contact, dtype: float64


In [142]:
print("Difference between the percentage of each 'month' value in the original test data set and the reduced data set")
print(df.month.value_counts(normalize=True)-df_reduced.month.value_counts(normalize=True))

Difference between the percentage of each 'month' value in the original test data set and the reduced data set
may   -0.012148
jul    0.003525
aug    0.005241
jun    0.005928
nov   -0.002892
apr   -0.002773
oct    0.000458
sep   -0.000321
mar    0.001558
dec    0.001425
Name: month, dtype: float64


In [143]:
print("Difference between the percentage of each 'day_of_week' value in the original test data set and the reduced data set")
print(df.day_of_week.value_counts(normalize=True)-df_reduced.day_of_week.value_counts(normalize=True))

Difference between the percentage of each 'day_of_week' value in the original test data set and the reduced data set
fri   -0.001156
mon    0.000139
thu    0.012491
tue   -0.001103
wed   -0.010371
Name: day_of_week, dtype: float64


In [144]:
print("Difference between the percentage of each 'poutcome' value in the original test data set and the reduced data set")
print(df.poutcome.value_counts(normalize=True)-df_reduced.poutcome.value_counts(normalize=True))

Difference between the percentage of each 'poutcome' value in the original test data set and the reduced data set
nonexistent   -0.006147
failure        0.004030
success        0.002117
Name: poutcome, dtype: float64


In [145]:
print("Difference between the percentage of each 'y' value in the original test data set and the reduced data set")
print(df.y.value_counts(normalize=True)-df_reduced.y.value_counts(normalize=True))

Difference between the percentage of each 'y' value in the original test data set and the reduced data set
no    -0.007977
yes    0.007977
Name: y, dtype: float64


We can conclude that the sample obtained is representative and we can use it for our project.

# Reduce the number of rows to balance the classes
The previous reduction of rows does not take into account if the dataset is balanced or not. As in the training data set the feature to predict must be balanced and as our data set is not balanced, we can reduce the training data set by eliminating only rows of the class with bigger proportion, so we are reducing the training data set and balancing it at the same time.

In [146]:
df.y.value_counts(normalize=True)

no     0.886412
yes    0.113588
Name: y, dtype: float64

In the original data set, the rows with y=no represent the 88.7% (36548 rows) of the totality and rows with y=yes represent only the 11.26% (4640).

In [147]:
#reducing the number of rows to 2/3 of 10000
df = pd.read_csv(r'../Datasets/training-dataset.csv')
df = df[df.filter(regex='^(?!Unnamed)').columns] #deleting unwanted column added 
nRowsToRemove = len(df) - 6667 
rowsToDropIndices = np.random.choice(df[df.y == 'no'].index, nRowsToRemove, replace = False)
df_balanced = df.drop(rowsToDropIndices)
df_reduced = df_reduced[df_reduced.filter(regex='^(?!Unnamed)').columns] #deleting unwanted column added 
df_balanced.to_csv(r'../Datasets/training-dataset-reduced.csv')

In [148]:
df_balanced.y.value_counts(normalize=True)

no     0.535623
yes    0.464377
Name: y, dtype: float64

In [149]:
#study of the  proportions/frequencies of each feature in the original and row-reduced training data set
df.describe(include='all')

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
count,27595.0,27595,27595,27595,27595,27595,27595,27595,27595,27595,...,27595.0,27595.0,27595.0,27595,27595.0,27595.0,27595.0,27595.0,27595.0,27595
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,7005,16716,8137,21857,14493,22679,17546,9240,5786,...,,,,23795,,,,,,24499
mean,40.029353,,,,,,,,,,...,2.566733,962.729697,0.173945,,0.077347,93.573949,-40.515912,3.616776,5166.898043,
std,10.422134,,,,,,,,,,...,2.80139,186.280254,0.493455,,1.572774,0.578898,4.630887,1.735507,72.275658,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,32.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,47.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


In [150]:
#study of the  proportions/frequencies of each feature in the original and row-reduced training data set
df_balanced.describe(include='all')

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
count,6667.0,6667,6667,6667,6667,6667,6667,6667,6667,6667,...,6667.0,6667.0,6667.0,6667,6667.0,6667.0,6667.0,6667.0,6667.0,6667
unique,,12,4,8,3,3,3,2,10,5,...,,,,3,,,,,,2
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,...,,,,nonexistent,,,,,,no
freq,,1814,3797,2116,5635,3539,5470,4713,1807,1443,...,,,,5222,,,,,,3571
mean,40.344083,,,,,,,,,,...,2.39583,894.291435,0.304935,,-0.437528,93.49303,-40.230418,3.030515,5138.534258,
std,11.919048,,,,,,,,,,...,2.550551,305.006734,0.680078,,1.723186,0.63442,5.281322,1.887306,86.70374,
min,17.0,,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,31.0,,,,,,,,,,...,1.0,999.0,0.0,,-1.8,92.893,-42.7,1.25,5076.2,
50%,38.0,,,,,,,,,,...,2.0,999.0,0.0,,-0.1,93.444,-41.8,4.076,5191.0,
75%,48.0,,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.959,5228.1,


In [151]:
print("Difference between the percentage of each 'job' value in the original data set and the balanced data set")
print(df.job.value_counts(normalize=True)-df_balanced.job.value_counts(normalize=True))

Difference between the percentage of each 'job' value in the original data set and the balanced data set
admin.          -0.018236
blue-collar      0.031577
entrepreneur     0.005588
housemaid        0.001337
management       0.003387
retired         -0.020046
self-employed   -0.001127
services         0.011948
student         -0.013558
technician       0.003368
unemployed      -0.003939
unknown         -0.000297
Name: job, dtype: float64


In [152]:
print("Difference between the percentage of each 'marital' value in the original data set and the balanced data set")
print(df.marital.value_counts(normalize=True)-df_balanced.marital.value_counts(normalize=True))

Difference between the percentage of each 'marital' value in the original data set and the balanced data set
married     0.036240
single     -0.036011
divorced    0.000213
unknown    -0.000443
Name: marital, dtype: float64


In [153]:
print("Difference between the percentage of each 'education' value in the original data set and the balanced data set")
print(df.education.value_counts(normalize=True)-df_balanced.education.value_counts(normalize=True))

Difference between the percentage of each 'education' value in the original data set and the balanced data set
basic.4y               0.004795
basic.6y               0.008064
basic.9y               0.018696
high.school           -0.005784
illiterate            -0.000393
professional.course    0.000730
university.degree     -0.022512
unknown               -0.003597
Name: education, dtype: float64


In [154]:
print("Difference between the percentage of each 'default' value in the original data set and the balanced data set")
print(df.default.value_counts(normalize=True)-df_balanced.default.value_counts(normalize=True))

Difference between the percentage of each 'default' value in the original data set and the balanced data set
no        -0.053144
unknown    0.053185
yes       -0.000041
Name: default, dtype: float64


In [155]:
print("Difference between the percentage of each 'housing' value in the original data set and the balanced data set")
print(df.housing.value_counts(normalize=True)-df_balanced.housing.value_counts(normalize=True))

Difference between the percentage of each 'housing' value in the original data set and the balanced data set
yes       -0.005620
no         0.006088
unknown   -0.000468
Name: housing, dtype: float64


In [156]:
print("Difference between the percentage of each 'loan' value in the original data set and the balanced data set")
print(df.loan.value_counts(normalize=True)-df_balanced.loan.value_counts(normalize=True))

Difference between the percentage of each 'loan' value in the original data set and the balanced data set
no         0.001393
yes       -0.000925
unknown   -0.000468
Name: loan, dtype: float64


In [157]:
print("Difference between the percentage of each 'contact' value in the original data set and the balanced data set")
print(df.contact.value_counts(normalize=True)-df_balanced.contact.value_counts(normalize=True))

Difference between the percentage of each 'contact' value in the original data set and the balanced data set
cellular    -0.071075
telephone    0.071075
Name: contact, dtype: float64


In [158]:
print("Difference between the percentage of each 'month' value in the original data set and the balanced data set")
print(df.month.value_counts(normalize=True)-df_balanced.month.value_counts(normalize=True))

Difference between the percentage of each 'month' value in the original data set and the balanced data set
may    0.063807
jul    0.007640
aug    0.006700
jun   -0.002483
nov    0.002585
apr   -0.015701
oct   -0.018920
sep   -0.018239
mar   -0.018850
dec   -0.006538
Name: month, dtype: float64


In [159]:
print("Difference between the percentage of each 'day_of_week' value in the original data set and the balanced data set")
print(df.day_of_week.value_counts(normalize=True)-df_balanced.day_of_week.value_counts(normalize=True))

Difference between the percentage of each 'day_of_week' value in the original data set and the balanced data set
fri    0.008690
mon    0.007302
thu   -0.006764
tue   -0.004661
wed   -0.004567
Name: day_of_week, dtype: float64


In [160]:
print("Difference between the percentage of each 'poutcome' value in the original data set and the balanced data set")
print(df.poutcome.value_counts(normalize=True)-df_balanced.poutcome.value_counts(normalize=True))

Difference between the percentage of each 'poutcome' value in the original data set and the balanced data set
nonexistent    0.079033
failure       -0.014882
success       -0.064151
Name: poutcome, dtype: float64


In this solution now statistics like the means, std, frequencies and other, have changed a bit more than they did in the previous form of reducing the number of rows studied. 
Despite that, all the other statistical changes are not that drastic.

