## Data Preprocessing

Data refers to examples or cases from the domain that characterize the problem you want to solve. 
 
     In supervised learning, data is composed of examples where each example has an input element that 
     will be provided to a model and an output or target element that the model is expected to predict.
 
 Input Variables: Columns in the dataset provided to a model in order to make a prediction.
 
 Output Variable: Column in the dataset to be predicted by a model

Data collected from your domain is referred to as raw data and is collected in the context of a problem you want to solve.


<b>Machine Learning Algorithms Expect Numbers</b>

       Most ML algorithms (linear regression, decision trees, SVMs, neural networks, etc.) work with numerical values.

       Categorical, textual, or image data must be converted into a numerical form 
                  (e.g., one-hot encoding, embeddings, pixel arrays).

       Scaling/normalizing numeric features is required so that features are on similar ranges.
          
<b>Machine Learning Algorithms Have Requirements</b>

    Every algorithm comes with assumptions or constraints.

        Linear Regression assumes linearity, no multicollinearity, and normally distributed residuals.

        SVM works best when features are scaled.

        KNN relies on distance metrics, so features must be numeric and ideally normalized.

    Violating these requirements often leads to poor performance.
    
<b>Model Performance Depends on Data</b>

    "Garbage in, garbage out" — if your data is noisy, missing, biased, or irrelevant, no algorithm will save the model.

    More high-quality and representative data usually beats tweaking the algorithm endlessly.

    Features should capture the patterns in the problem domain.

<b>Predictive Modeling Is Mostly Data Preparation</b>

    Around 70–80% of ML project time is spent on cleaning, preprocessing, and feature engineering.

    Steps include: handling missing values, encoding categorical data, normalization/standardization, 
    dealing with imbalanced data, and feature selection.

    A well-prepared dataset often allows even simple models to perform surprisingly well.
 

##### The algorithm matters, but the quality, representation, and preparation of data matter much more.

### Data Generation

In [None]:
import numpy as np
import pandas as pd

In [9]:
import warnings
warnings.filterwarnings('ignore')

### Data Generation

Range of this attributes are as follows:
1.	Employee_id  : 1-100
2.	Age                             : 25-62
3.	Basic pay                       : 15,600-67000
4.	No.of clients                   :1-1000
5.	Years of Services               :0-40
6.	Performance Score               :0/1

In [None]:
data_employee={ 'employee_id':np.arange(1,101),
                'Age':np.random.randint(25,63,size=100),
                'Basic Pay':np.random.randint(15600,67100,size=100),
                'No of Clients':np.random.randint(1,1000,size=100),
                'Years of Service':np.random.randint(0,41,size=100),
                'Performance Score':np.random.randint(0,2,size=100)
              }
df=pd.DataFrame(data_employee,columns=['employee_id','Age','Basic Pay',
                                       'No of Clients','Years of Service',
                                       'Performance Score'])            
df

In [None]:
df.to_csv('data/emp.csv',sep=',',index=False)

In [None]:
e=pd.read_csv('data/emp.csv')

In [None]:
e.head(5)

In [None]:
e1=pd.read_csv('data/data_after_missing_values.csv')
e1.head(5)

In [None]:
e1.isna().sum()

In [None]:
e1.count()

In [None]:
e1.info()

In [None]:
e1['Age'].mean()

In [None]:
e1['Basic Pay'].median()

##  Data Imputation or Missing Values

### Data Imputation:
      Dealing with missing values
        
    1. If the data is Quantitative : discrete / continuous
              use statictical measure as Mean/Median
              
    2. If the data us Qualitative / Categoircal then use Mode

In [None]:
e1 = e1.fillna({
    'Age': e['Age'].mean(),
    'Basic Pay': e['Basic Pay'].mean(),
    'No of Clients': e['No of Clients'].mean(),
    'Years of Service': e['Years of Service'].median()
})

In [None]:
print(e1)

In [None]:
e1.to_csv('data/empmean.csv',sep=',',index=False)

In [None]:
e1=pd.read_csv('data/empmean.csv')
e1.head(5)

In [None]:
import pandas as pd
import random

zones = ['Downtown', 'Airport', 'Tech Park', 'Suburbs', 'Stadium']
time_slots = ['Morning', 'Afternoon', 'Evening', 'Night']
day_types = ['Weekday', 'Weekend']

data = []

for i in range(10):
    zone = random.choice(zones)
    time = random.choice(time_slots)
    day_type = random.choice(day_types)

    data.append([zone, time, day_type])

In [None]:
df = pd.DataFrame(data, columns=['Zone', 'Time', 'DayType'])
df

In [None]:
df.to_csv('data/taxi.csv',sep=',',index=False)

In [None]:
df=pd.read_csv('data/taxi.csv')
df

In [None]:
df['DayType'].mode()[0]

In [None]:

df['DayType'] = df['DayType'].fillna(df['DayType'].mode()[0])

print(df)

### with dynamic data median from boxplot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv('data/titanic_train.csv')
train

In [None]:
train.isnull().sum()

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

In [None]:
train['Survived'].value_counts()

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train)
plt.show()

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

In [None]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

In [None]:
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train

In [None]:
train.isnull().sum()

### Data Encoding

### Machine Leanrning libraies does not support caregorical / Qualitative data
 
 So will go for converting Qualitative to Quantitative
 
 1. Pandas version : done manually when we have few number of class labels Vs map()*
 2. Using Sklearn LabelEncoder
 3. One hot encoding : The dis advantage is, it is going to create more number of number of new columns depending on the class lables


In [None]:
# Pandas Encoding

In [4]:
import pandas as pd

df = pd.read_csv('data/scaling test1.csv')
df


Unnamed: 0,ID,Age,Workclass
0,1001,25,Private
1,1002,38,Private
2,1003,28,State-gov
3,1004,36,Central-gov
4,1005,20,Others


In [7]:
df.Workclass.unique

<bound method Series.unique of 0        Private
1        Private
2      State-gov
3    Central-gov
4         Others
Name: Workclass, dtype: object>

In [6]:

df['Workclass'] = df['Workclass'].str.strip()

In [10]:

workclass_mapping = {
    "Workclass": {
        "Private": 1,
        "State-gov": 2,
        "Central-gov": 3,
        "Others": 4
    }
}

df.replace(workclass_mapping, inplace=True)

print(df)

     ID  Age  Workclass
0  1001   25          1
1  1002   38          1
2  1003   28          2
3  1004   36          3
4  1005   20          4


# map()

In [11]:
import pandas as pd

df = pd.read_csv('data/scaling test1.csv')

df['Workclass'] = df['Workclass'].str.strip()

workclass_mapping = {
    "Private": 1,
    "State-gov": 2,
    "Central-gov": 3,
    "Others": 4
}

df['Workclass'] = df['Workclass'].map(workclass_mapping)

print(df)

     ID  Age  Workclass
0  1001   25          1
1  1002   38          1
2  1003   28          2
3  1004   36          3
4  1005   20          4


In [None]:
# Label Encoder

In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('data/scaling test1.csv')

df_encoded = df.copy()

le = LabelEncoder()

df_encoded['Workclass_encoded'] = le.fit_transform(df['Workclass'])

print(df_encoded)

     ID  Age    Workclass  Workclass_encoded
0  1001   25      Private                  2
1  1002   38      Private                  2
2  1003   28    State-gov                  3
3  1004   36  Central-gov                  0
4  1005   20       Others                  1


In [None]:
# one hot encoding

In [13]:
import pandas as pd

df = pd.read_csv('data/scaling test1.csv')

df_dummies = pd.get_dummies(df, columns=['Workclass'], prefix='Workclass')

print(df_dummies)

     ID  Age  Workclass_Central-gov  Workclass_Others  Workclass_Private  \
0  1001   25                  False             False               True   
1  1002   38                  False             False               True   
2  1003   28                  False             False              False   
3  1004   36                   True             False              False   
4  1005   20                  False              True              False   

   Workclass_State-gov  
0                False  
1                False  
2                 True  
3                False  
4                False  


In [14]:

df_dummies = pd.get_dummies(df, columns=['Workclass'], prefix='Workclass', dtype=int)

print(df_dummies)


     ID  Age  Workclass_Central-gov  Workclass_Others  Workclass_Private  \
0  1001   25                      0                 0                  1   
1  1002   38                      0                 0                  1   
2  1003   28                      0                 0                  0   
3  1004   36                      1                 0                  0   
4  1005   20                      0                 1                  0   

   Workclass_State-gov  
0                    0  
1                    0  
2                    1  
3                    0  
4                    0  


### Outlier Analysis
An outlier is a data point in a data set that is distant from all other observations. 

A data point that lies outside the overall distribution of the dataset.

In [None]:
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt

In [None]:
df = pd.read_csv("data/heights.csv")
df

In [None]:
upper_limit = df.height.mean() + 3*df.height.std()
upper_limit

In [None]:
lower_limit = df.height.mean() -3*df.height.std()
lower_limit

In [None]:
df[(df.height>upper_limit) | (df.height<lower_limit)]

In [None]:
import numpy as np
import pandas as pd

data = np.random.normal(0, 1, 1000)

print("25th percentile (Q1):", np.quantile(data, 0.25))
print("Median (Q2):", np.median(data))
print("75th percentile (Q3):", np.quantile(data, 0.75))


In [None]:
df = pd.read_csv("data/heights.csv")
df

In [None]:
max_thresold = df['height'].quantile(0.95)
max_thresold

In [None]:
df[df['height']>max_thresold]

In [None]:
min_thresold = df['height'].quantile(0.05)
min_thresold

In [None]:
df[df['height']<min_thresold]

In [None]:
df[(df['height']<max_thresold) & (df['height']>min_thresold)]

In [None]:
import pandas as pd
df = pd.read_csv("data/heights.csv")
df

In [None]:
Q1 = df.height.quantile(0.25)
Q3 = df.height.quantile(0.75)
Q1, Q3

In [None]:
IQR = Q3 - Q1
IQR

In [None]:
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit

In [None]:
df[(df.height<lower_limit)|(df.height>upper_limit)]

In [None]:
df_no_outlier = df[(df.height>lower_limit)&(df.height<upper_limit)]
df_no_outlier

### Data Normalization/Scaling/Transformation*

Normalization is a critical step in data preprocessing that rescales numerical features to a standard range.

Why Normalization Matters

1. Algorithm Performance: Many machine learning algorithms perform better when features are on similar scales

        Distance-based algorithms (k-NN, SVM, k-means) are sensitive to feature scales

        Gradient descent converges faster when features are normalized

2. Feature Dominance Prevention: Prevents features with larger ranges from dominating those with smaller ranges

        Example: In a dataset with age (0-100) and income (0-1,000,000), income would dominate distance calculations without normalization

3. Regularization Impact: Regularization terms in models like Ridge/Lasso are affected by feature scales

    Features with larger scales get penalized more

Normalization often leads to better model performance, faster convergence, and more meaningful feature comparisons. 

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

data = np.array([6, 7, 7, 12, 13, 13, 15, 16, 19, 22])
stats.zscore(data)

In [None]:
data = np.array([[5, 6, 7, 7, 8],
                 [8, 8, 8, 9, 9],
                 [2, 2, 4, 4, 5]])
stats.zscore(data)

In [None]:
data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
data

In [None]:
data.apply(stats.zscore)

![normalization1.png](attachment:normalization1.png)

![minmax.png](attachment:minmax.png)

![zscore.png](attachment:zscore.png)

![logscaling.png](attachment:logscaling.png)

In [None]:
from sklearn.preprocessing import MinMaxScaler

data = [[3,2],
        [15,6],
        [0,10],
        [1,18]]

scalar = MinMaxScaler()

In [None]:
print(scalar.fit_transform(data))

In [None]:
from sklearn.preprocessing import StandardScaler

data = [[3,2],[15,6],[0,10],[1,18]]

scalar = StandardScaler()

print(scalar.fit_transform(data))


In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p)

data = [[3,2],[15,6],[0,10],[1,18]]

transformer.transform(data)

#### Min-max normalization is preferred when data doesn’t follow Gaussian or normal distribution. 

It’s favored for normalizing algorithms that don’t follow any distribution, such as KNN and neural networks. 

Note that normalization is affected by outliers.

##### Standardization can be helpful in cases where data follows a Gaussian distribution. 

However, this doesn’t necessarily have to be true. 

In addition, unlike normalization, standardization doesn’t have a bounding range. 

This means that even if there are outliers in data, they won’t be affected by standardization.

##### Log scaling is preferable if a dataset holds huge outliers.

### 1. Min-Max normalization 

In [None]:
import pandas as pd

X_train=pd.read_csv('data/X_train.csv')

Y_train=pd.read_csv('data/Y_train.csv')

X_test=pd.read_csv('data/X_test.csv')

Y_test=pd.read_csv('data/Y_test.csv')

print (X_train.head())
print (Y_train.head())

In [None]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History']],Y_train)
from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test,knn.predict(X_test[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History']])))

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
#sklearn provides a tool MinMaxScaler that will scale down all the features between 0 and 1. Mathematical formula for MinMaxScaler is.

from sklearn.preprocessing import MinMaxScaler

min_max=MinMaxScaler()

X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

print(X_train_minmax)

X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

knn=KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_minmax,Y_train)

print(accuracy_score(Y_test,knn.predict(X_test_minmax)))

print('standardized')

Distance algorithms like KNN, K-means, and SVM use distances between data points to determine their similarity. 

They’re most affected by a range of features. 

Machine learning algorithms like linear regression and logistic regression use gradient descent for optimization techniques that require data to be scaled. 

Having similar scale features can help the gradient descent converge more quickly towards the minima. 

On the other hand, tree-based algorithms are not sensitive to the scale of the features. 

This is because a decision tree only splits a node based on a single feature, and this split is not influenced by other features.

### 2. Z-score normalization

In [None]:
from sklearn.preprocessing import scale

X_train_scale=scale(X_train[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

X_test_scale=scale(X_test[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

# Fitting logistic regression on our standardized data set

from sklearn.linear_model import LogisticRegression

log=LogisticRegression(penalty='l2',C=.01)

log.fit(X_train_scale,Y_train)
# Checking the model's accuracy

accuracy_score(Y_test,log.predict(X_test_scale))

###  Dimensionality reduction

![dm.png](attachment:dm.png)

### PCA

In [None]:
import pandas as pd
url = "data/iris.csv"

df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])
df

In [None]:
from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = df.loc[:, features].values
x

In [None]:
# Separating out the target
y = df.loc[:,['target']].values
print(y)

In [None]:
# Standardizing the features
x = StandardScaler().fit_transform(x)
print(x)

In [None]:
#The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions.
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

principalComponents = pca.fit_transform(x)

print(principalComponents)

In [None]:
principalDf = pd.DataFrame(data = principalComponents, columns = ['pc1','pc2','pc3'])
print(principalDf)

In [None]:
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
print(finalDf)

In [None]:
pca.explained_variance_ratio_

In [None]:

import numpy as np

exp_var_cum=np.cumsum(pca.explained_variance_ratio_)
print(exp_var_cum)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.step(range(exp_var_cum.size), exp_var_cum)

In [None]:
pd.DataFrame({'Components':['C1', 'c2','C3'],'Variance':pca.explained_variance_ratio_}).plot.bar(x='Components',y='Variance',rot=0);

<b>explained_variance_ratio</b> is used to determine the number of principal components that sufficiently provides the information contained in the original feature.

PCA doesn’t work well on high dimensional sparse matrix due to computational constraints resulting in a memory error issue. 

In such cases, TruncatedSVD can be used.

### SVD

In [None]:
# Separating out the features
x = df.loc[:, features].values
x

In [None]:
# Separating out the target
y = df.loc[:,['target']].values
y

In [None]:
# Standardizing the features
x = StandardScaler().fit_transform(x)
x

In [None]:
from sklearn.decomposition import TruncatedSVD 

svd = TruncatedSVD(n_components=3) 
svd.fit_transform(x); 

pd.DataFrame({'Components':['C1','C2', 'C3'],'Variance':svd.explained_variance_ratio_}).plot.bar(x='Components',y='Variance',rot=0);

In [None]:
svd.explained_variance_ratio_

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

exp_var_cum=np.cumsum(svd.explained_variance_ratio_)
print(exp_var_cum)


### LDA

In [None]:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = df.loc[:, features].values
print(x)

y = df.loc[:,['target']].values

In [None]:
lda = LinearDiscriminantAnalysis(n_components=2)

#  Fit the LDA model and transform the original feature space `y.ravel()` is used to ensure y is 1D, as LDA expects
X_lda = lda.fit(x, y.ravel()).transform(x)

#  Print the reduced number of features 

print('Reduced number of features:', X_lda.shape[1])

# Show the proportion of class-separating variance captured by this component

print('Explained variance ratio:', lda.explained_variance_ratio_)

# Print the number of features in the original dataset

print('Original number of features:', x.shape[1])


# Feature Selection

#### SelectKBest on load_iris

In [None]:

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest, chi2

iris = load_iris()

X, y = iris.data, iris.target

selector = SelectKBest(chi2, k=3)

X_new = selector.fit_transform(X, y)


In [None]:

selected_mask = selector.get_support()  

selected_mask


In [None]:

selected_indices = selector.get_support(indices=True) 
selected_indices


In [None]:

selected_scores = selector.scores_


In [None]:
print("Feature scores:", selected_scores)


In [None]:
print("Selected feature names:")

for i in selected_indices:
    print(f"- {iris.feature_names[i]}")
    

### SelectPercentile on iris data

In [None]:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile, chi2

X, y = load_iris(return_X_y=True)
print("Original shape:", X.shape)  # Expected: (150, 4)

X_new = SelectPercentile(chi2, percentile=50).fit_transform(X, y)
print("Reduced shape:", X_new.shape)  


In [None]:

selector = SelectPercentile(chi2, percentile=50).fit(X, y)

print("Chi2 scores:", selector.scores_)
print("Selected mask:", selector.get_support())


### GenericUnivariateSelect

In [None]:

from sklearn.datasets import load_iris
from sklearn.feature_selection import GenericUnivariateSelect, chi2

X, y = load_iris(return_X_y=True)
print("Original shape:", X.shape)  

# Apply GenericUnivariateSelect with chi2 and mode='k_best', param=2
# You can adjust 'mode' and 'param' as needed

selector = GenericUnivariateSelect(score_func=chi2, mode='k_best', param=2)
X_new = selector.fit_transform(X, y)

print("Reduced shape:", X_new.shape)  


In [None]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

iris = load_iris()
X, y = iris.data, iris.target
print("Original shape:", X.shape)  

# Train a Linear SVM with L1 regularization
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False, max_iter=2000).fit(X, y)

# Use SelectFromModel to pick features with non-zero weights
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)

print("Reduced shape:", X_new.shape)


In [None]:

selected_mask = model.get_support()
feature_names = iris.feature_names

print("Selected features:")
for i, selected in enumerate(selected_mask):
    if selected:
        print(f"- {feature_names[i]}")


### Mutual Information

Mutual Information measures how much knowing one variable reduces uncertainty about another.

In feature selection, it tells us how much a feature tells us about the target variable.

In [None]:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, mutual_info_classif

iris = load_iris()
X, y = iris.data, iris.target
print("Original shape:", X.shape)

# Select top 2 features based on Mutual Information
selector = SelectKBest(score_func=mutual_info_classif, k=3)
X_new = selector.fit_transform(X, y)

selected_indices = selector.get_support(indices=True)
scores = selector.scores_

print("Reduced shape:", X_new.shape)
print("Selected feature indices:", selected_indices)
print("Feature scores:", scores)


### Variance Threshold 

In [None]:
A Variance Threshold is the simplest form of feature selection 
    — it removes features whose variance is below a given threshold.
        
Low variance means the feature doesn’t change much across samples, 
    so it’s unlikely to be useful for classification or regression.

In [None]:

from sklearn.datasets import load_iris
from sklearn.feature_selection import VarianceThreshold
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
print("Original shape:", X.shape)

# Apply Variance Threshold (remove features with variance <= 0.2)
selector = VarianceThreshold(threshold=0.2)
X_new = selector.fit_transform(X)

selected_indices = selector.get_support(indices=True)
variances = selector.variances_

print("Reduced shape:", X_new.shape)
print("Selected feature indices:", selected_indices)
print("Feature variances:", variances)


In [None]:
### ExtraTreesClassifier

In [None]:

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)

clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
print(clf.feature_importances_) 

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print(X_new.shape)


### Case Study

In [None]:

import pandas as pd
import numpy as np

my_data = pd.read_csv("data/diabetes.csv")
my_data


In [None]:

my_data.shape


In [None]:

my_data.groupby('Outcome').size()


In [None]:

#Blood pressure : By observing the data we can see that there are 0 values for blood pressure.
# And it is evident that the readings of the data set seems wrong because a living person 
# cannot have diastolic blood pressure of zero. 

print("Total : ", my_data[my_data.BloodPressure == 0].shape[0])

print(my_data[my_data.BloodPressure == 0].groupby('Outcome')['Age'].count())


In [None]:

print("Total : ", my_data[my_data.SkinThickness == 0].shape[0])

print(my_data[my_data.SkinThickness == 0].groupby('Outcome')['Age'].count())


In [None]:

print("Total : ", my_data[my_data.Insulin == 0].shape[0])

print(my_data[my_data.Insulin == 0].groupby('Outcome')['Age'].count())


In [None]:

print("Total : ", my_data[my_data.DiabetesPedigreeFunction == 0].shape[0])

print(my_data[my_data.DiabetesPedigreeFunction == 0].groupby('Outcome')['Age'].count())


In [None]:

my_data = my_data[(my_data.BloodPressure != 0) & (my_data.BMI != 0) & (my_data.Glucose != 0)]

print(my_data.shape)


In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions
array = my_data.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
print(rescaledX[0:5,:])


In [None]:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

print(fit.scores_)


In [None]:

features = fit.transform(X)

print(features[0:5,:])


The scores for each attribute and the 4 attributes chosen (those with the highest
scores): Glucose, Insulin, BMI and Age.

Pregnancies    Glucose    BloodPressure     SkinThickness     Insulin      BMI         DiabetesPedigreeFunction      Age  

106.51          1337.86      42.53                 70.68      2480.04     94.72             5.75                   181.22

In [None]:

from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()

model.fit(X, Y)
print(model.feature_importances_)


The scores suggest at the importance of Glucose, Age and BMI.

Pregnancies    Glucose    BloodPressure     SkinThickness     Insulin      BMI         DiabetesPedigreeFunction      Age  
0.11             0.24        0.09                0.08          0.08        0.14              0.12                    0.14

#### DecisionTreeClassifier

In [None]:

from sklearn.model_selection import train_test_split

feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
X = my_data[feature_names]
y = my_data.Outcome

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)


In [None]:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, Y_train)


In [None]:

print("Feature importances:\n{}".format(tree.feature_importances_))


In [None]:

import matplotlib.pyplot as plt
%matplotlib inline

def plot_feature_importances_diabetes(model):
    plt.figure(figsize=(8,6))
    n_features = 8
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)

plot_feature_importances_diabetes(tree)
plt.show()


#### RandomForestClassifier

In [None]:

from sklearn.ensemble import RandomForestClassifier


In [None]:

rf1 = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)

rf1.fit(X_train, Y_train)


In [None]:

importances = rf1.feature_importances_
importances


In [None]:
Fea_Imp = pd.Series(importances)
print(Fea_Imp)