__contents:__
1. [Import Library](#Import_Library)
2. [Data Collection](#data_collection)
3. [Exploration and Understanding](#Exploration-and-Understanding)
   + 3.1. [View Data Structure](#View-Data-Structure)
   + 3.2. [Desceiptive Statistics](#Desceiptive_Statistics)
   + 3.3. [Specify the number of rows and columns](#Specify_the_number_of_rows_and_columns)
   + 3.4. [Specify the name of the column header](#Specify_the_name_of_the_column_header)
   + 3.5. [Rename columns](#Rename_columns)
   + 3.6. [Delete a column or a row](#Delete_a_column_or_a_row)
   + 3.7. [add a column](#add_a_column)
   + 3.8. [add a row](#add_a_row)
4. [Handling Missing Value](#Handling_Missing_Value)
    + 4.1. [identify missing value](#identify_missing_value)
        * 4.1.1. [Identify the row that are completely empty](#Identify_the_row_that_are_completely_empty)
        * 4.1.2. [imputation or removal](#imputation_or_removal)
5. [Data Cleaning](#Data_Cleaning)
6. [Identifying outliers](#Identifying_outliers)
7. [Data Transformation](#Data_Transformation)
    + 7.1. [Scaling](#Scaling)
        - 7.1.1. [Normalization](#Normalization)
        - 7.1.2. [Standardlization](#Standardlization)
        - 7.1.3. [Encoding Categorical Variable](#Encoding_Categorical_Variable)

# 1.Import Library <a id='Import_Library'></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# 2.Data Collection <a id='data_collection'></a>

<font color='#808080'>
Obtain data from reliable sources, such as datases, websites or CSV files, ... .
</font>

<font color='#808080'>
1) read file Or load the Data
</font>

In [None]:
df = pd.read_csv("/content/drive/MyDrive/preprocessing data/Data.csv")
df = pd.DataFrame(df)

# 3.Exploration and Understanding <a id='Exploration and Understanding'></a>

## 3.1.View Data Structure <a id='View Data Structure'></a>

- ```df.head()```: <font color='#808080'> Displays the first 5 rows of the DataFrame or the first 10 rows if ```df.head(10)``` is used. </font>s.

In [None]:
df.head()
df.head(10)

- ```df.info()```: <font color='#808080'> Provides a concise summary of the DataFrame, including data types and non-null values. </font>

In [None]:
df.info()

- ```df.tail()```: <font color='#808080'> Displays the last 5 rows of the DataFrame (or the last 10 rows if `df.tail(10)` is used). </font>

In [None]:
df.tail()
df.tail(10)

- ```df.sample()```: <font color='#808080'> Returns a random row from the DataFrame (or a random set of 10 rows if ```df.sample(10)``` is used). </font>

In [None]:
df.sample()
df.sample(10)

- ```df.index```: <font color='#808080'> Returns the index (row labels) of the DataFrame. </font>

In [None]:
df.index

- ```df.attrs```: <font color='#808080'> Returns a dictionary of global attributes associated with the DataFrame. </font>

In [None]:
df.attrs

- ```df.value_counts()```: <font color='#808080'> Returns the counts of unique values in a Series or DataFrame column, sorted in descending order. </font>

In [None]:
df.value_counts()

- ```df["feature"].value_counts().idxmax()```: <font color='#808080'> Returns the most frequent value (mode) in the "feature" column.
- ``` df["feature"].value_counts().max() ```: <font color='#808080'>Returns the highest frequency (count) of the most frequent value in the "feature" column.
- ```df["feature"].value_counts().idxmin() ```: <font color='#808080'>Returns the least frequent value in the "feature" column.
- ``` df["feature"].value_counts().min()```: <font color='#808080'>Returns the lowest frequency (count) of the least frequent value in the "feature" column.
</font>

In [None]:
df["feature name"].value_counts().idxmax()
df["feature name"].value_counts().max()
df["feature name"].value_counts().idxmin()
df["feature name"].value_counts().min()

- ```df.values```: <font color='#808080'>Returns the DataFrame's data as a NumPy array, excluding the index and column labels.</font>

In [None]:
df.values

- ```df.dtypes```: <font color='#808080'>Returns a Series with the data types of each column in the DataFrame.</font>

In [None]:
df.dtypes

- ```df.axes```: <font color='#808080'>Returns a list containing the row and column index objects of the DataFrame, providing the structure of its axes.</font>

In [None]:
df.axes

- ```df.empty```: <font color='#808080'>Returns a boolean value indicating whether the DataFrame is empty (i.e., has no rows).</font>

In [None]:
df.empty

- ```df.nunique()```: <font color='#808080'>Returns the count of unique values for each column in the DataFrame.
- ```df["column_name"].nunique()``` <font color='#808080'>Returns the count of unique values in the specified column (column_name) of the DataFrame) </font>`.

In [None]:
df.nunique()
df["column_name"].nunique()

## 3.2.Desceiptive Statistics <a id='Desceiptive_Statistics'></a>

- ```df.describe()```: <font color='#808080'>Generates descriptive statistics for the DataFrame.</font>

In [None]:
df.describe()

- ```df["column_name"].mean()```: <font color='#808080'>Returns the average (mean) of the values in the specified column (column_name).</font>
- ``` df["column_name"].mode()```: <font color='#808080'>Returns a Series containing the mode(s) of the specified column (column_name). If there are multiple modes, all of them will be returned.
- ```df["column_name"].median()```: <font color='#808080'>Returns the median (the middle value) of the specified column (column_name).
- ```df["column_name"].std()```: <font color='#808080'>Returns the standard deviation of the values in the specified column (column_name), measuring the dispersion of the data points.
- ```df["column_name"].var()```: <font color='#808080'>Returns the variance of the values in the specified column (column_name), indicating the degree of spread in the data set.
- ```df["column_name"].min()```: <font color='#808080'>Returns the smallest value in the specified column (column_name).
- ```df["column_name"].max()```: <font color='#808080'>Returns the largest value in the specified column (column_name).
- ```df["column_name"].quantile(q)```: <font color='#808080'>Returns the value at the given quantile ```(e.g., q=0.5 for the median)``` for the specified column (column_name).

## 3.3.Specify the number of rows and columns <a id='Specify_the_number_of_rows_and_columns'></a>

- ```df.shape```: <font color='#808080'>Returns the number of rows and columns

In [None]:
df.shape

## 3.4.Specify the name of the column header <a id='Specify_the_name_of_the_column_header'></a>

- ```df.column```: <font color='#808080'>Returns all Header names of the column

In [None]:
df.columns

## 3.5.Rename columns <a id='Rename_columns'></a>

- ```df.rename(columns={'Age': 'age'}, inplace=False)```: <font color='#808080'> Returns a new DataFrame with the specified columns renamed```((in this case changing'Ag' to'ag)```), without modifying the original DataFrame since`` `inplace=Fals``e` is set.

In [None]:
df.rename(columns={'Age': 'age'}, inplace=False)

## 3.6.Delete a column or a row <a id='Delete_a_column_or_a_row'></a>

- ```df = df.drop("Age", axis=1)```: <font color='#808080'>  Removes the column labele "Ag" from the DataFram  wit axis = 1, ` indicating that a column is being dropped.

In [None]:
df = df.drop("Age", axis=1)

- ```df = df.drop(0, axis=0)```: <font color='#808080'> Removes the row with the index label 0 from the DataFrame, with axis=0 indicating that a row is being dropped.

In [None]:
df = df.drop(0, axis=0)

## 3.7.add a column <a id='add_a_column'></a>

- ```df["M"] = 10```: <font color='#808080'>Creates a new column named "M" in the DataFrame and assigns the value 10 to all rows in that column.
- ```df["M"] = np.nan```: <font color='#808080'>Creates a new column named "M" in the DataFrame and assigns NaN (Not a Number) to all rows in that column, effectively indicating missing or undefined values.  

In [None]:
df["M"] = 10
df["M"] = np.nan

- ```df["upper_name"] = df['Country'].apply(lambda x : x.upper())```: <font color='#808080'>Creates a new column named "upper_name" in the DataFrame by applying a lambda function that converts the values in the "Country" column to uppercase. 

In [None]:
df["upper_name"] = df['Country'].apply(lambda x : x.upper())

- ```df["new_column"] = [0,1,2,0,0,0,1,1,0]```: <font color='#808080'> Creates a new column named "new_column" in the DataFrame and assigns the provided list of values [0, 1, 2, 0, 0, 0, 1, 1, 0] to it. The length of the list must match the number of rows in the DataFrame; otherwise, it will raise an error.

In [None]:
df["new_column"] = [0,1,2,0,0,0,1,1,0]

## 3.8.add a row <a id='add_a_row'></a>

<font color='#808080'> **adding a new row  using loc[]:**

**<font color='#808080'>these methods only allows adding column to the beginning or end of the DataFrame**
- ```df.loc[len(df)] = ["Iran", 54.0, 52000, "Yes"]```: <font color='#808080'>Adds a new row at the end of the DataFrame with the specified values, using the current length of the DataFrame as the index.

In [None]:
df.loc[len(df)] = ["Iran", 54.0, 52000, "Yes"]

- ```df.loc[5:] = ["Iran", 54.0, 52000, "Yes"]```: <font color='#808080'>Attempts to assign the specified values to all rows from index 5 to the end of the DataFrame, which can result in an error if the indices do not match.

In [None]:
df.loc[5:] = ["Iran", 54.0, 52000, "Yes"]

- ```df.loc[:] = ["Iran", 54.0, 52000, "Yes"]```: <font color='#808080'> Attempts to replace all rows in the DataFrame with the specified values.

In [None]:
df.loc[:] = ["Iran", 54.0, 52000, "Yes"]

**<font color='#808080'>adding a new row with using concat()**
- ``` new_data = pd.DataFrame({"Country": ["Dubai"], "Age": [39.0], "Salary": [96000.0], "Purchased": ["Yes"]}):```: Creates a new DataFrame named new_data with one row of data.
- ```df = pd.concat([df, new_data], ignore_index=True)```: <font color='#808080'> Concatenates new_data to the original DataFrame df, resetting the index with ignore_index=True to maintain sequential indexing.

In [None]:
new_data = pd.DataFrame({"Country":["Dubai"],	"Age":[39.0], "Salary":[96000.0],	"Purchased":["Yes"]})
df = pd.concat([df, new_data], ignore_index=True)

- ```new_data = pd.DataFrame({"Country":["Usa"],"Age":[45.0],"Salary":[10000.2],"Purchased":["No"]})```: <font color='#808080'> Creates another DataFrame named new_data with one row of data.
- ```df = df._append(new_data, ignore_index=True)```: <font color='#808080'> Appends new_data to the DataFrame df, similar to concat(), but with an emphasis on the append function.

In [None]:
new_data = pd.DataFrame({"Country":["Usa"],"Age":[45.0],"Salary":[10000.2],"Purchased":["No"]})
df = df._append(new_data, ignore_index=True)

- ```new_row = pd.Series([np.nan]*len(df.columns), index=df.columns)```: <font color='#808080'>Creates a new Series filled with NaN values, matching the number of columns in the DataFrame.
- ``` df = df._append(new_row, ignore_index=True)```: <font color='#808080'> Appends the new row filled with NaN to the DataFrame df, effectively adding a row with missing values.

In [None]:
new_row = pd.Series([np.nan]*len(df.columns), index=df.columns)
df = df._append(new_row, ignore_index=True)

# 4.Handling Missing Value <a id='Handling_Missing_Value'></a>

## 4.1.identify missing value <a id='identify_missing_value'></a>

**<font color='#808080'>Identifying Missing Values in Rows**
- ```df.isna().sum(axis=1)```: <font color='#808080'>Returns a Series containing the count of missing values (NaN) for each row in the DataFrame. axis=1 indicates that the operation is performed across columns.
- ```df.isnull().sum(axis=1)```: <font color='#808080'>Functions similarly to isna(), returning a Series with the count of missing values for each row.

In [None]:
df.isna().sum(axis=1)
df.isnull().sum(axis=1)

**<font color='#808080'>Identifying Missing Values in Columns**
- ```df.isna().sum(axis=0)```:<font color='#808080'>Returns a Series containing the count of missing values (NaN) for each column in the DataFrame. axis=0 indicates that the operation is performed across rows.
- ```df.isnull().sum(axis=0)```:<font color='#808080'>Functions similarly to isna(), returning a Series with the count of missing values for each column.

In [None]:
df.isna().sum(axis=0)
df.isnull().sum(axis=0)

### 4.1.1.Identify the row that are completely empty <a id='Identify_the_row_that_are_completely_empty'></a>

**<font color='#808080'>Identify the Number of Completely Empty Rows**
- ```df.isna().all(axis=1).sum()```:<font color='#808080'>Returns the count of rows that are completely empty (i.e., all values are NaN) by checking each row with isna(), using all(axis=1) to ensure all values in the row are NaN, and then summing the resulting boolean values.
- ```df.isnull().all(axis=1).sum()```:<font color='#808080'> Functions similarly to the above method, returning the count of completely empty rows.

In [None]:
df.isna().all(axis=1).sum()
df.isnull().all(axis=1).sum()

**<font color='#808080'>Show Indices of Completely Empty Rows**
- ```df.index[df.isna().all(axis=1)].tolist()```:<font color='#808080'>Returns a list of the indices of rows that are completely empty (all values are NaN). The condition checks each row for completeness, and the resulting indices are converted to a list.

In [None]:
df.index[df.isna().all(axis=1)].tolist()

### 4.1.2.imputation or removal <a id='imputation_or_removal'></a>

**<font color='#808080'>Deleting Rows that are Completely Empty**
- ```df.dropna(how="all")```:<font color='#808080'>Removes all rows from the DataFrame that are completely empty (i.e., all values are NaN). The how="all" parameter specifies that a row should be dropped only if all its values are NaN.

In [None]:
df.dropna(how="all")

**<font color='#808080'>Filling NaN Values in a Column**
- ```df["Purchased"] = df["Purchased"].fillna("No")```:<font color='#808080'>Fills any NaN values in the "Purchased" column with the string "No". This replaces missing values with a specified default value.

In [None]:
df["Purchased"] = df["Purchased"].fillna("No")

**<font color='#808080'>Filling NaN Values with Mean, Mode, or Median**
- ```df["Age"] = df["Age"].fillna(df["Age"].mean())```:<font color='#808080'> Replaces NaN values in the "Age" column with the mean of that column. Similar methods can be applied using .mode() or .median() for replacing with the mode or median, respectively:
*Mode:* ```df["Age"] = df["Age"].fillna(df["Age"].mode()[0])```

*Median:* ```df["Age"] = df["Age"].fillna(df["Age"].median())``` 

In [None]:
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Age"] = df["Age"].fillna(df["Age"].mode()[0])
df["Age"] = df["Age"].fillna(df["Age"].median())

# 5.Data Cleaning <a id='Data_Cleaning'></a>

**<font color='#808080'>Identifying Duplicated Rows**
- ```df.duplicated().sum()```:<font color='#808080'>Returns the count of duplicated rows in the DataFrame. This method marks rows as duplicated if they are identical to a previous row.

In [None]:
df.duplicated().sum()

**<font color='#808080'>Showing Duplicated Rows**
- ```df.index[df.duplicated(keep=False)].tolist()```:<font color='#808080'> Returns a list of the indices of all duplicated rows. The keep=False parameter indicates that all occurrences of duplicated rows should be included, not just the first.

In [None]:
df.index[df.duplicated(keep=False)].tolist()

**<font color='#808080'>Removing Duplicated Rows**
- ```df.drop_duplicates()```:<font color='#808080'> Removes duplicated rows from the DataFrame, keeping the first occurrence by default. This method returns a new DataFrame without the duplicates, while the original DataFrame remains unchanged unless assigned back to it. 

In [None]:
df.drop_duplicates()

# 6.Identifying outliers <a id='Identifying_outliers'></a>

**<font color='#808080'> Using Z-score for Normal Distribution:**
- ```z_score = stats.zscore(df["Age"])```: <font color='#808080'> Calculates the Z-scores for the "Age" column, which measures how many standard deviations each value is from the mean. This is useful for identifying outliers in a normally distributed dataset.
- ```outliners = df[(z_score > 3) | (z_score < -3)]```: <font color='#808080'>Identifies the outliers based on the Z-scores. Rows where the absolute value of the Z-score is greater than 3 are considered outliers.

In [None]:
z_score = stats.zscore(df["Age"])
outliners = df[(z_score > 3) | (z_score <-3)]

**<font color='#808080'>Using IQR for Abnormal Distribution**
- ```q1 = df["Age"].quantile(0.25)```:<font color='#808080'> Calculates the first quartile (25th percentile) of the "Age" column.
- ```q3 = df["Age"].quantile(0.75)```:<font color='#808080'> Calculates the third quartile (75th percentile) of the "Age" column.
- ```IQR = q3 - q1```:<font color='#808080'>Computes the Interquartile Range (IQR), which is the difference between the third and first quartiles.
- ```outliners = df[(df["Age"] < (q1 - 1.5 * IQR)) | (df["Age"] > (q3 + 1.5 * IQR))]```: <font color='#808080'>Identifies outliers using the IQR method. Values below ```𝑞1 − 1.5 × 𝐼𝑄𝑅``` or above ```q3 + 1.5 × IQR``` are considered outliers.

In [None]:
q1 = df["Age"].quantile(0.25)
q3 = df["Age"].quantile(0.75)
IQR = q3 - q1
outliners = df[(df["Age"] < (q1 - 1.5 * IQR)) | (df["Age"] > (q3 + 1.5 * IQR))]

**<font color='#808080'>Using Boxplot**
- ```sns.boxplot(x=df["Age"])```:Creates a boxplot for the "Age" column using Seaborn, visually representing the distribution, median, quartiles, and potential outliers.

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x=df["Age"])
plt.title("Boxplot for age")
plt.xlabel("Age")
plt.show()

# 7.Data Transformation <a id='Data_Transformation'></a>

**<font color='#808080'>Identifying a Column with Two Formats**
- ```df_string = df["Age"].apply(lambda x : isinstance(x, str))```: <font color='#808080'> Checks each value in the "Age" column to determine if it is a string. The result is a Series of boolean values.

In [None]:
df_string = df["Age"].apply(lambda x : isinstance(x, str))

**<font color='#808080'>Deleting Noise**
- ```pd.to_numeric(df["Age"], errors="coerce")```:<font color='#808080'>Converts the "Age" column to numeric values. Any values that cannot be converted (e.g., strings or non-numeric entries) are replaced with NaN due to the errors="coerce" parameter, effectively cleaning the column of noise.

In [None]:
pd.to_numeric(df["Age"],  errors="coerce")

**<font color='#808080'>Converting the Data Type of a Column**
- ```df["Age"] = df["Age"].astype(str)```: <font color='#808080'>Converts the "Age" column to string data type. 

In [None]:
df["Age"] = df["Age"].astype(str)

## 7.1.**Scaling**  <a id='Scaling'></a>

### 7.1.1. **Normalization** <a id='Normalization'></a>

<font color='#808080'>the descriptions for the different data scaling methods mentioned, including the context of scaling data to a specific range (0 to 1):

**<font color='#808080'>Min-Max Scaling**
- ```scaler = MinMaxScaler()```: <font color='#808080'> Initializes the MinMaxScaler.
- ```scaler.fit_transform(data)```: <font color='#808080'>Fits the scaler to the data and transforms it, scaling each feature to a range between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit_transform(data)

**<font color='#808080'>Max Abs Scaling** Scales each feature by dividing by its maximum absolute value, preserving the sign and resulting in values in the range [-1, 1].

- ```max_abs_scaler = MaxAbsScaler()```: <font color='#808080'> Initializes an instance of the MaxAbsScaler to scale features by their maximum absolute value.
- ```scaled_data = max_abs_scaler.fit_transform(data)```: <font color='#808080'> Fits the scaler to the data and transforms it, scaling each feature to the range [-1, 1] based on its maximum absolute value.

In [None]:
from sklearn.preprocessing import MaxAbsScaler
max_abs_scaler = MaxAbsScaler()
scaled_data = max_abs_scaler.fit_transform(data)

**<font color='#808080'> Logarithm scaling** Transforms data by applying the natural logarithm, useful for reducing skewness and managing exponential growth, especially with positive values.

In [None]:
# z-score NOrmalization

### 7.1.2.**Standardlization** <a id='Standardlization'></a>

<font color='#808080'>Standardization of data means scaling the data so that the mean of the data is equal to zero and the standard deviation is equal to one. 

**<font color='#808080'>Advantages:**

+ <font color='#808080'> Applicable in sensitive models like logistic regression, SVM, and KNN that are sensitive to the scale of the data.
+ <font color='#808080'> It is excellent for data that approximately follows a normal distribution.
+ <font color='#808080'> In algorithms like gradient descent, standardizing the data accelerates the learning process.

- ```scaler = StandardScaler()```:<font color='#808080'> Initializes an instance of the StandardScaler, which standardizes features by removing the mean and scaling to unit variance.
- ```scaler.fit_transform(data)```:<font color='#808080'> Fits the scaler to the provided data and transforms it, resulting in a dataset with a mean of 0 and a standard deviation of 1 for each feature.


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(data)

**<font color='#808080'> In cases where we have outliers, we should use the following.**
- ```scaler = RobustScaler()```:Initializes an instance of the `RobustScaler`, which scales features using statistics that are robust to outliers, specifically the median and the interquartile range (IQR), making it suitable for datasets with extreme values.

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

### 7.1.3.**Encoding Categorical Variable** <a id='Encoding_Categorical_Variable'></a>

**<font color='#808080'>One-Hot Encoding**A common method for converting categorical data into a numerical format, creating binary columns for each category. 
* <font color='#808080'> _with Pandas_:
- ```pd.get_dummies(df, columns=[",,,,"])```:<font color='#808080'> generate one-hot encoded columns directly from a DataFrame.
* <font color='#808080'> _with sklearn_:
- ```encoder = OneHotEncoder(sparse=False)```:<font color='#808080'> initialize OneHotEncoder, and use ```encoder.fit_transform(df[...])``` to encode the specified categorical features.
- ```encoder.get_feature_names_out([...])```: <font color='#808080'>Create a DataFrame from the encoded data and set the column names and Display.

In [None]:
pd.get_dummies(df, columns=[",,,,"])

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse = False)
encoded_data = encoder.fit_transform(df[",,,"])
encoder.get_feature_names_out([",,,"])
pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out[",,,"])

**<font color='#808080'>Label Encoding** Converts categorical data into numerical format by assigning a unique integer to each category.
Implemented using ```LabelEncoder``` from ```sklearn.preprocessing```, where ```encoder.fit_transform(df[...])``` converts the specified categorical feature.


In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit_transform(df[",,"])

**<font color='#808080'>Discretization** The process of converting continuous data into discrete categories or bins.

+ <font color='#808080'> _with Pandas_:

<font color='#808080'> Use ```pd.cut(df[...], bins=..., labels=[])``` to define bins for the continuous variable and assign labels.
+ <font color='#808080'> _Scikit-learn_:

Import ```KBinsDiscretizer from sklearn.preprocessing```, configure it with the desired number of bins and encoding type, and use it to transform the continuous feature.

In [None]:
#first way with pandas
pd.cut(df[",,,"], bins= , labels=[])

#second way with sklearn
from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=  , encode= "ordinal", strategy="uniform")