# Data loading:

In [1]:
import pandas as pd
from sklearn.datasets import load_wine

data = load_wine()
df =  pd.DataFrame(data=data.data, columns=data.feature_names)
target = data.target
X, y = load_wine(return_X_y=True, as_frame=True)

## pandas and excel:

In [2]:
# df_1 = pd.read_csv('carseats.csv')
# df_2 = pd.read_excel('filename.xls', sheet_name='name')
# df_3 = pd.read_excel("url_link", sheet_name="sheet_name")

# Data exploration:

In [3]:
df.describe(include='all').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
alcohol,178.0,13.000618,0.811827,11.03,12.3625,13.05,13.6775,14.83
malic_acid,178.0,2.336348,1.117146,0.74,1.6025,1.865,3.0825,5.8
ash,178.0,2.366517,0.274344,1.36,2.21,2.36,2.5575,3.23
alcalinity_of_ash,178.0,19.494944,3.339564,10.6,17.2,19.5,21.5,30.0
magnesium,178.0,99.741573,14.282484,70.0,88.0,98.0,107.0,162.0
total_phenols,178.0,2.295112,0.625851,0.98,1.7425,2.355,2.8,3.88
flavanoids,178.0,2.02927,0.998859,0.34,1.205,2.135,2.875,5.08
nonflavanoid_phenols,178.0,0.361854,0.124453,0.13,0.27,0.34,0.4375,0.66
proanthocyanins,178.0,1.590899,0.572359,0.41,1.25,1.555,1.95,3.58
color_intensity,178.0,5.05809,2.318286,1.28,3.22,4.69,6.2,13.0


In [4]:
df.shape

(178, 13)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

In [6]:
df.isnull().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

Non standard missing values:

In [7]:
import numpy as np
df['hue']= df['hue'].apply(lambda x: np.nan if x == '?' else x)

### Linear classifiers
- logistic classifiers (target class = 2 outcomes)
- softmax classifier (target class + 2 outcomes)
- naive bayes classifier:
    - bernoulli (binary data)
    - multinomial (discrete data)
    - gaussian (continuous data)

#### Optional
   - Ridge classifier
   - Lasso classifier

### Non-linear classifiers
- K-nearest neighbor classifier
- Decision trees

### Ensembles
- Random forests
- Voting classifiers
- Extra forests
- Baggers / Boosters / Pasters


#### Preprocessing steps

- Check for duplicates: `df.drop_duplicates(inplace = True)`

#### Drop / Impute Missing values
   - If the column is 60% + empty drop it.
   - otherwise, median for numerical and most frequent for categorical
   - Papandrea strategy is to drop and say it does not affect results.
   - if the dataset is balanced you can also use mean instead of median.
   - Simple imputer strategies:  ['mean', 'median', 'most_frequent', 'constant']

#### Outliers
   - Use quantile snippet as it's the fastest to execute.

- Categorical vs Numerical:
   - `df_numerical = df.select_dtypes(exclude=["object"], [“category”])`
   - `df_categorical = df.select_dtypes(include=["object"], [“category”])`

- Encoding is always required to use the sklearn implementations, but is not strictly needed for decision trees and rf.

- Decision Trees and Random Forests perform better with label encoders:

    - **Ordinal encoding:**
        - `dictionary={"ShelveLoc": {"Bad":0, "Medium":1, "Good":2}}`
        - `df.replace(dictionary, inplace=True)`

    - **Categorical with OneHotEncoding:**
        - always declare to have access to attributes.

#### Scaling
   - always fine, justify why you are using it with decision trees and random
   - forests as it's not strictly needed.
   - logistic regressions (because they use gradient descent) and knn because it's sensitive to scale need it.

#### Normalizing
Information based algorithms (Decision Trees, Random Forests) and probability
based algorithms (Naive Bayes, Bayesian Networks) don't require normalization.
Normalizing is good for knn and logistic regression.
- Naive Gaussian can only be used with normal features.
- log transform: `df['column'] = np.log(df['column'] + 1)` for poisson distributions.

### Evaluation metrics
- Accuracy score
- Confusion Matrix
- Classification Report
- ROC
- ROC-AUC
- Multi-label Confusion Matrix
- oob_decision_score for decision trees (use in combination with classification report).
- not useful for unbalanced dataset.