## **Open Source Data Repositories**
1. [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) - Small, manageable and standard datasets from almost all domains

2. [USA data.gov Initiative](https://www.data.gov/) - US government open-sourced data. Lots of untapped potential.

3. [World Bank Data](https://data.worldbank.org/) - Econometric, administrative and credit data for almost all countries. Low granularity(less data points), high latency (slow updates).

4. [Quandl](https://www.quandl.com/) - Fianancial and Econometric data. High granularity, low latency(nightly updates).

5. [Kaggle Datasets](https://www.kaggle.com/datasets) - Pretty much everything. Good forum conversations.  

## **PIMA Indian Diabetes Data - UCI**
You can access the Diabetes Data and its relevant documentation [here on UCI website](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)

In [2]:
import pandas as pd
import numpy as np

In [3]:
diabetes = pd.read_csv("pima_indians_diabetes.csv")
diabetes.head()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


**Note - that the data doesn't have a header. So we better define column names first and specify that to read_csv files**

In [4]:
labels = ["Pregnant", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]

diabetes = pd.read_csv("pima_indians_diabetes.csv",names=labels)
diabetes.head()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## **Data Exploration**

In [7]:
# Find data type for each attribute 
print("Data type of each attribute:")
diabetes.dtypes

Data type of each attribute:


Pregnant                      int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [8]:
# Generate statistical summary 
description = diabetes.describe()
print("Statistical summary of the data:\n")
description

Statistical summary of the data:



Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### **Check the prediction outcome**

In [10]:
class_counts = diabetes.groupby('Outcome').size() 
print("Class breakdown of the data:\n")
print(class_counts)

Class breakdown of the data:

Outcome
0    500
1    268
dtype: int64


#### **Converting Outcome into categorical**

In [11]:
diabetes['Outcome'] = diabetes['Outcome'].astype('category')

In [14]:
diabetes.dtypes

Pregnant                       int64
Glucose                        int64
BloodPressure                  int64
SkinThickness                  int64
Insulin                        int64
BMI                          float64
DiabetesPedigreeFunction     float64
Age                            int64
Outcome                     category
dtype: object

##### Check summary statistics

In [12]:
diabetes.describe()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


In [13]:
diabetes["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [15]:
# Compute correlation matrix 
correlations = diabetes.corr(method = 'pearson') 
print("Correlations of attributes in the data:\n") 
correlations

Correlations of attributes in the data:



  correlations = diabetes.corr(method = 'pearson')


Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Pregnant,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0


#### **Check for Outliers**

In [18]:
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib notebook

fig, axs = plt.subplots()
sns.boxplot(data=diabetes,orient='h',palette="Set2")
plt.show()

<IPython.core.display.Javascript object>

### **Dealing with Outliers**

**Reporting upper whisker for Insulin**

In [19]:
q75, q25 = np.percentile(diabetes["Insulin"], [75 ,25])

iqr = q75-q25

print("IQR",iqr)

whisker = q75 + (1.5*iqr)

print("Upper whisker",whisker)

IQR 127.25
Upper whisker 318.125


###### **Clip/Squash the values beyond certain point *
**Here all values of Insulin greater than the upper whisker will be replaced with 318.125**

In [20]:
diabetes["Insulin"] = diabetes["Insulin"].clip(upper=whisker)

In [21]:
fig, axs = plt.subplots()
sns.boxplot(data=diabetes,orient='h',palette="Set2")
plt.show()

<IPython.core.display.Javascript object>

#### Check missing values

**Note - While there are no apparent missing values, through data exploration we should notice that certain columns have 0 values which is not possible.**

In [36]:
print((diabetes.iloc[:,[1,2,3,4,5]] == 0).sum())

Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
dtype: int64


#### **Replacing 0 with NA**

In [23]:
diabetes.iloc[:,[1,2,3,4,5]] = diabetes.iloc[:,[1,2,3,4,5]].replace(0, np.NaN)
diabetes.head()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


#### **Sum of NA values across each column**

In [24]:
diabetes.isnull().sum(axis=0)

Pregnant                      0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

#### Dealing with Missing Values

##### **A) Drop rows having NaN**
**This may lead to excessive loss of data. Use this method only if there is very less NA values**

In [25]:
print("Size before dropping NaN rows",diabetes.shape,"\n")
nan_dropped = diabetes.dropna()
print("\nSize after dropping NaN rows",nan_dropped.shape)

Size before dropping NaN rows (768, 9) 


Size after dropping NaN rows (392, 9)


##### **B) Drop row/columns having more than certain percentage of NaNs (more sensible)**

In [26]:
diabetes.isnull().mean()

Pregnant                    0.000000
Glucose                     0.006510
BloodPressure               0.045573
SkinThickness               0.295573
Insulin                     0.486979
BMI                         0.014323
DiabetesPedigreeFunction    0.000000
Age                         0.000000
Outcome                     0.000000
dtype: float64

#### **Dropping rows and columns having greater than certain % NA**

In [27]:
print("Size before dropping NaN rows",diabetes.shape,"\n")
## Dropping columns having more than 40% NA values
col_dropped = diabetes.loc[:, diabetes.isnull().mean() < .3]
## Dropping rows having more than 30% NA values
row_dropped = diabetes.loc[diabetes.isnull().mean(axis=1) < .3, :]
print("\nSize after dropping columns",col_dropped.shape)
print("Size after dropping rows",row_dropped.shape)

Size before dropping NaN rows (768, 9) 


Size after dropping columns (768, 8)
Size after dropping rows (733, 9)


##### **C) Impute missing values**
1. Some constant value that is considered "normal" in the domain
2. Summary statistic like Mean, Median, Mode
3. **A value estimated by algorithm or predictive model** - Will be taught later. Don't get ahead of yourself or you'll miss the real fun ;)

##### **Mean imputation**

In [29]:
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
mean_imputer.fit(diabetes)
imputed_diabetes = pd.DataFrame(mean_imputer.fit_transform(diabetes),columns=labels)

**Important Take-Away - Transformations should be applied in a 2-phase strategy**

#### Range Scaling

In [30]:
from sklearn.preprocessing import MinMaxScaler
range_scaler = MinMaxScaler()
range_scaler.fit(imputed_diabetes)
range_scaled_diabetes = pd.DataFrame(range_scaler.fit_transform(imputed_diabetes),columns=labels)

**Notice the summary statistics after range scaling. The range of each column should be between 0-1**

In [31]:
range_scaled_diabetes.describe()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,0.22618,0.501205,0.49393,0.240798,0.426031,0.291564,0.168179,0.204015,0.348958
std,0.19821,0.196361,0.123432,0.095554,0.197299,0.140596,0.141473,0.196004,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.058824,0.359677,0.408163,0.195652,0.353473,0.190184,0.070773,0.05,0.0
50%,0.176471,0.470968,0.491863,0.240798,0.426031,0.290389,0.125747,0.133333,0.0
75%,0.352941,0.620968,0.571429,0.271739,0.426031,0.376278,0.234095,0.333333,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


####  Standardization

In [32]:
from sklearn.preprocessing import StandardScaler
standardizer = StandardScaler()
standardizer.fit(imputed_diabetes)
std_diabetes = pd.DataFrame(standardizer.fit_transform(imputed_diabetes),columns=labels)

**Notice the summary statistics after standardization. The standard deviation of each column should be 1**

In [33]:
std_diabetes.describe()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-6.476301e-17,-3.561966e-16,6.915764e-16,7.956598e-16,6.707597000000001e-17,3.515706e-16,2.451743e-16,1.931325e-16,7.401487e-17
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-2.554131,-4.004245,-2.52167,-2.160728,-2.075119,-1.189553,-1.041549,-0.7321202
25%,-0.8448851,-0.7212214,-0.695306,-0.4727737,-0.3679958,-0.7215397,-0.6889685,-0.7862862,-0.7321202
50%,-0.2509521,-0.1540881,-0.01675912,8.087936e-16,0.0,-0.008363615,-0.3001282,-0.3608474,-0.7321202
75%,0.6399473,0.610309,0.6282695,0.3240194,0.0,0.6029301,0.4662269,0.6602056,1.365896
max,3.906578,2.54185,4.102655,7.950467,2.911036,5.042087,5.883565,4.063716,1.365896


##### Binning 
**Going ahead you shall realize that having numeric variable is the best for Data Science. However certain times for sake of explanation to the client one may want to convert a numeric variable into multiple classes. This can be done using binning.**

*You will learn better strategys to club things based on similar behaviour. Binning is very trivial against other advanced techniques that you will learn ahead.*

In [34]:
bins = [0,25,30,35,40,100]

group_names = ['Malnutritioned', 'Under-Weight', 'Healthy', 'Over-Wight',"Obese"]
diabetes['BMI Class'] = pd.cut(diabetes['BMI'], bins, labels=group_names)
diabetes.head(3)

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,BMI Class
0,6,148.0,72.0,35.0,,33.6,0.627,50,1,Healthy
1,1,85.0,66.0,29.0,,26.6,0.351,31,0,Under-Weight
2,8,183.0,64.0,,,23.3,0.672,32,1,Malnutritioned


##### Dummification
**This is used to encode multi-level columns (eg. Malnutritioned, Under-Weight, Healthy) to much more explicit binary encoding (1 or 0 for each level denoting presense or absence).**

*Lets just say this is a more cleaner way of representing features and computer likes it this way.*

In [35]:
dummified_data = pd.concat([diabetes.iloc[:,:-1],pd.get_dummies(diabetes['BMI Class'])],axis=1)
dummified_data.head()

Unnamed: 0,Pregnant,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Malnutritioned,Under-Weight,Healthy,Over-Wight,Obese
0,6,148.0,72.0,35.0,,33.6,0.627,50,1,0,0,1,0,0
1,1,85.0,66.0,29.0,,26.6,0.351,31,0,0,1,0,0,0
2,8,183.0,64.0,,,23.3,0.672,32,1,1,0,0,0,0
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0,0,1,0,0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1,0,0,0,0,1


**Notice that now we have seperate columns for Malnutritioned, Under-Weight, Healthy having 0 or 1**