# 1. [Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))

## 1.1. [Check if there is a duplicated row](#Check-if-there-is-a-duplicated-row)
### 1.1.1. [duplicated()](#duplicated())
### 1.1.2. [drop_duplicates()](#drop_duplicates())

## 1.2. [Check if there is a duplicated row](#Check-if-there-is-a-duplicated-row)

In [10]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# sonuçlarda çıkan warning'leri ignore etmek için;
from warnings import filterwarnings
filterwarnings("ignore")

# dataframe'de kaç satır ve sütun gösterilsin;
#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

#Jupyter notebook satırlarını genişletir
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% ! important; }<style>"))

#örnek veri setimiz;
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


<a id='Exploratory-Data-Analysis-(EDA)'></a>
# 1.Exploratory Data Analysis (EDA)

<a id='Check-if-there-is-a-duplicated-row'></a>
## 1.1. Check if there is a duplicated row

* Spot the duplicated observations in the dataset and discard them
* Duplicated datas dont contribute anything so we dont need them
* **duplicated()** ve **drop_duplicates()** fonksiyonları kullanılır.

<a id='duplicated()'></a>
### 1.1.1. duplicated()

1. Returns True, if any row duplicates.


2. Combination with any() function, shows duplicity at once for a dataframe or selected features.


3. Available to check a column of a dataframe, multiple columns of a dataframe or a whole dataframe with using its **subset=["column_name"]** parameter.


4. This function randomly shows one of the duplicated observations(if any exists). To show the all duplicated rows use; **keep=False**

In [11]:
# 1
df = sns.load_dataset('iris')
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Length: 150, dtype: bool

In [12]:
# 2
df = sns.load_dataset('iris')
df.duplicated().any()

True

In [13]:
# 3
df = sns.load_dataset('iris')
df.duplicated(subset=["sepal_width", "sepal_length"])

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148    False
149     True
Length: 150, dtype: bool

In [20]:
# 4
df = sns.load_dataset('iris')
df[df.duplicated(keep=False)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
101,5.8,2.7,5.1,1.9,virginica
142,5.8,2.7,5.1,1.9,virginica


<a id='drop_duplicates()'></a>
### 1.1.2. drop_duplicates()

1. Discards duplicated rows and leaves one of them in dataframe.


2. **Inplace=True** ; makes permenant this transaction


3. **ignore_index=True** ; Ignores indexes of discarded duplicated rows and sorts index number from 0 again.

In [21]:
df = sns.load_dataset('iris')
df.drop_duplicates(inplace=True, ignore_index=True)

## 1.2. Have an initial inspection on the dataset

Check the given proporties of your dataset to have an inspire.

1. Size of your dataset >>> **df.shape**
2. Variable types >>> **df.info()**
3. Descriptive statistics >>> **df.describe()**
4. Get frequency of classes or values in each feature >>> **value_counts()**
5. Get unique classes or values in each feature >>> **unique()**
6. Get how many unique values each feature has >>> **nunique()**

### 1.2.1. df.shape

In [33]:
df = sns.load_dataset('iris')
df.shape

(150, 5)

### 1.2.2. df.info()

In [35]:
df = sns.load_dataset('iris')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


### 1.2.3. df.describe()

* describe() only shows statistics for numerical variable with default usage. Use **include="all"** to get statistics for categorical variables as well.
* It's also available to return statistics for different classes in a feature/variable

In [38]:
df = sns.load_dataset('iris')
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
sepal_length,150.0,,,,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,,,,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,,,,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,,,,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5
species,150.0,3.0,setosa,50.0,,,,,,,


In [39]:
df = sns.load_dataset('iris')
df.groupby("species")["petal_length"].describe().T

species,setosa,versicolor,virginica
count,50.0,50.0,50.0
mean,1.462,4.26,5.552
std,0.173664,0.469911,0.551895
min,1.0,3.0,4.5
25%,1.4,4.0,5.1
50%,1.5,4.35,5.55
75%,1.575,4.6,5.875
max,1.9,5.1,6.9


### 1.2.4. value_counts()

* Useful parameters of this function;

1. **normalize=** >>> default selection is False, if True then it returns frequency of each classes as percentage
2. **ascending=** >>> default selection is False, if True then it returns frequencies in descending order.
3. **dropna=** >>> default selection is False, if True then it includes frequency of NaN values.

In [41]:
df = sns.load_dataset('iris')
df.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [40]:
df.species.value_counts(normalize=True)

setosa        0.333333
versicolor    0.333333
virginica     0.333333
Name: species, dtype: float64

### 1.2.5. unique()

In [42]:
df = sns.load_dataset('iris')
df.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

### 1.2.6. nunique()

In [43]:
df = sns.load_dataset('iris')
df.species.nunique()

3

## 1.4. Split the data to train/val(test) in an appropriate way

1. It's a necessary step to avoid data leakage and overfitting problems.


2. You should better not leak any information from your train set to test set or from test set to train set. Data leakage biases the predictions, that is not desired in Machine Learning.


3. You should decide what proportion that you would like to use for your train and test sets in data splitting. Note that the train set can't be lower than %50. ***Try and observe the performance of different train test proportions while creating a machine learning model***


4. You should also choose a proper data splitting method to create a powerful model. There are 2 options; **"Regular(Normal) splitting"** and **"Stratified splitting"**


5. Regular Splitting; Available to use for both regression and classification problems.


6. Stratified Splitting; Available to use in classification problems, especially if the target feature is imbalanced. **It keeps the percentage of target features same in both train and test sets.**


7. ***Note that; check distributions of all the features after splitting. It's especially may be complicated with categorical features if a value is existing in on of the data group but not existing in the other one.***



*Helpful Soruce;*
* https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/

### 1.4.1. Regular(Normal) Splitting;

In [22]:
# the library and the function that we use;
from sklearn.model_selection import train_test_split

df = sns.load_dataset('iris')
df.head(3)

In [28]:
# "species" is our target feature

# I would like to have %60 of the dataset as train set

# Shuffle=True;
# shuffle all the observations while splitting, necessary to provide randomness while splitting

# random_state;
# The function splits the data randomly,
# so it's good to specify an random_state id to get identical splitting results after everytime you have to activate it.

X_train, X_test, y_train, y_test = train_test_split(df.drop(["species"], axis=1), 
                                                    df.species,
                                                    test_size=0.40,
                                                    shuffle=True,
                                                    random_state=22)

# Shuffle özelliğini açtığımız için index'ler karışmış halde olacaktır. Indexleri düzene sokmak için şu kodları yaz;
for i in [X_train, X_test, y_train, y_test]:
    i.reset_index(inplace=True, drop=True)

### 1.4.2. Stratified Splitting;

* This time we additionally use the **stratify=** parameter.

In [30]:
# the library and the function that we use;
from sklearn.model_selection import train_test_split

df = sns.load_dataset('iris')
df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [31]:
# This time we additionally use the stratify= parameter.
X_train, X_test, y_train, y_test = train_test_split(df.drop(["species"], axis=1), 
                                                    df.species,
                                                    test_size=0.40,
                                                    shuffle=True,
                                                    random_state=22,
                                                    stratify= df.species)

# Shuffle özelliğini açtığımız için index'ler karışmış halde olacaktır. Indexleri düzene sokmak için şu kodları yaz;
for i in [X_train, X_test, y_train, y_test]:
    i.reset_index(inplace=True, drop=True)