# ***Python for ML***

## Introduction

##### This is day 1 of my **2-weeks AI/ML Engineer Learning plan**. Today, the libraries such as *NumPY* and *Pandas* which handles array and data manipulation were explored and discussed. Also, different libraries were also discussed such as *Matplotlib* and *Seaborn* for displaying pl. Furthermore, I have used the following datasets for exploring different types of syntax for Pandas:  

##### - Pokemon dataset
##### - Titanic dataset from Kaggle

##### Agenda

##### - Comfortably manipulate arrays with NumPy
##### - Perform real data manipulation using Pandas
##### - Conduct basic Exploratory Data Analysis (EDA)
##### - Produce a clean, professional Jupyter Notebook

## Data Loading

##### In this section, NumPy library is used to manipulate, create and edit arrays. Moreover, the objectives of this section is to:

##### - Understand NumPy arrays vs Python lists
##### - Perform vectorized operations
##### - Learn broadcasting

#### Step 1: Add *NumPy* to Python Library

In [None]:
pip install numpy

#### Step 2: Load *NumPy* in the notebook by importing numpy as np

In [12]:
import numpy as np

#### Step 3: Do Hands-on Tasks

##### Create 1D array

In [13]:
a = np.array([1, 2, 3])
print(a)

[1 2 3]


##### Create 2D array

In [16]:
b = np.array([[1, 2, 3],[4, 5, 6]])
print(b)

[[1 2 3]
 [4 5 6]]


##### Perform matrix addition and multiplication

In [19]:
c = a+b
print(c)

[[2 4 6]
 [5 7 9]]


In [20]:
c = a*b
print(c)

[[ 1  4  9]
 [ 4 10 18]]


##### Apply broadcasting manually and verify results

In [22]:
# Manually expand y's dimension using np.newaxis or None
a_broadcasted = a[np.newaxis, :] # Shape is now (1, 3)

# Manually create the full array by tiling (replicating) the values
a_stretched = np.tile(a_broadcasted, (2, 1)) # Shape is now (2, 3)

# Perform element-wise addition on the stretched arrays
manual_result = b + a_stretched

print("Original a:\n", a)
print("Original b:\n", b)
print("Manually broadcasted a:\n", a_stretched)
print("Manual calculation (b + a_stretched):\n", manual_result)


Original a:
 [1 2 3]
Original b:
 [[1 2 3]
 [4 5 6]]
Manually broadcasted a:
 [[1 2 3]
 [1 2 3]]
Manual calculation (b + a_stretched):
 [[2 4 6]
 [5 7 9]]


## Data Cleaning

##### In this section, *Pandas* is used for data manipulation of the dataset "Pokemon" which came from the youtube tutorial entitled "Pandas tutorial - Corey Schafer".

##### Agenda

##### - Load datasets
##### - Clean and transform data
##### - Prepare data for ML pipelines

#### Step 1: Add *Pandas* library in Python

In [None]:
pip install pandas

#### Step 2: Load *Pandas* in the notebook by importing pandas as pd

In [None]:
import pandas as pd

#### Step 3: Do Hands-on Tasks

##### Load CSV dataset

In [None]:
df = pd.read_csv("pokemon.csv")

##### Filter rows based on conditions

In [None]:
#Filter all items which are grass and poison type pokemons with HP greater than 70
new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison') & (df['HP'] > 70)]

new_df.reset_index(drop=True, inplace=True)

new_df.to_csv('filtered.csv')

##### Fill or drop missing values

In [23]:
# Drop "Total" column in the table
df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']

df = df.drop(columns=['Total'])

# Add "Total" column in the table between "Type 2" and "HP" columns
df['Total'] = df.iloc[:, 4:10].sum(axis=1)

cols = list(df.columns)
df = df[cols[0:4] + [cols[-1]] + cols[4:12]]

df.head(5)

<class 'NameError'>: name 'df' is not defined

##### Compute summary statistics

In [None]:
#Aggregate statistics using groupby() method
df = pd.read_csv('modified.csv')

df['count'] = 1

df.groupby(['Type 1', 'Type 2']).count()['count']

#Compute summary statistics using describe() method

summary = df.describe(include'all')

print("\nAll Columns Summary:")
print(summary_all)

## EDA (Exploratory Data Analysis)

##### This section covers Exploratory Data Analysis which uses statistics and visualizations to further understand dataset's features, discover anomalies and correlations between values and uncover patterns.

##### Agenda

##### - Understand dataset structure
##### - Identify patterns, outliers and distributions
##### - Build intuition before modeling

#### Hands-on Tasks

##### Analyze numerical vs categorical features

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Load train.csv in Kaggle Titanic dataset
df = pd.read_csv("train.csv")

#Data Manipulation
df = df[['PassengerId', 'Survived', 'Pclass', 'Age', 'Parch','Fare', 'Embarked']].copy()
df= df.rename(columns={'PassengerId': 'ID'})
print(df['Fare'].value_counts()).head(5).plot()

#Determine the categories and their types
df.columns
df.dtypes()

#Show numerical features
df.describe() 

#Show categorical columns
df.describe(include=['O', 'category']) 

# General descriptive statistics for all numerical columns
print(df.select_dtypes(include=['number']).describe())

# Plotting a histogram for a specific numerical column
df['numerical_column'].hist()
plt.title('Distribution of Numerical Column')
plt.show()

# Compute the correlation matrix
correlation_matrix = df.select_dtypes(include=['number']).corr()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap of Numerical Feature Correlations')
plt.show()

# Compute the correlation matrix
correlation_matrix = df.select_dtypes(include=['number']).corr()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap of Numerical Feature Correlations')
plt.show()

# Plotting a bar chart for a specific categorical column's value counts
df['categorical_column'].value_counts().plot(kind='bar')
plt.title('Count of Categories')
plt.show()

# Calculate mean of a numerical column grouped by a categorical column
print(df.groupby('categorical_column')['numerical_column'].mean())

# Get a more comprehensive statistical summary using .describe() after grouping
print(df.groupby('categorical_column')['numerical_column'].describe())


##### Identify potential target leakage

In [None]:
# Calculate correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

# Example: Grouped analysis for a categorical feature 'Feature_A'
print(df.groupby('Feature_A')['target'].mean())

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_file("Report.html")

##### Write insights in markdown cells

##### - Gender heavily correlates to the survival category of the dataset.
##### - Age was also significant because the number of persons with younger age has the higher probabilities for survival compared to those with older age.
##### - Family Size which is determined by "SibSp and Parch" also correlates to the survival. Small family size had the highest survival rates.

## Key Observations

#### - Array Manipulation using *NumPy* is much more useful compared to using lists in Python because of the following factors:

####        - *NumPy* uses contiguous memory compared to lists. This means that the data in Numpy is stored adjacently, which improves efficient caching and memory optimization.
####        - *NumPy* also consumes less memory compared to lists. *NumPy* is stored as float or int (usually 28 to 34 bytes of memory) compared to lists which is considered as strings or objects (usually 49 bytes or more of memory)

#### - Data Manipulation using *Pandas* is useful since it lets you analyze, compare and manipulate datasets easily using different methods from the *Pandas* library. Also, another usefool tool such as *Matplotlib* and *Seaborn* for data visualizations to further analyze the data using plots like, bar graphs, density plots and scatter plots.

## Conclusions

#### Today, *NumPy* library was used for array manipulation in Python. 1D and 2D arrays with variable values were created using ***np.array()*** method. Also, basic matrix addition and multiplication were also done using operations such as + and * for both arrays *a* and *b*. Furthermore, the broadcasting feature was applied wherein the size of array *a*, which is a different size compared to array *b*, was manipulated to the same size using manual expansion. 

#### *Pandas* library was also used for manipulating data in both pokemon and kaggle datasets. In the pokemon datasets, the csv file was loaded using ***read_csv()*** method and filter rows based on certain conditions. Also, columns and values were also dropped to change the dataset into a much more meaningful dataset. Then, summary statistics where also computed using the ***describe()*** method.

#### Exploratory Data Analysis was also done using *Pandas*, *Matplotlib* and *Seaborn* libraries for data analysis and visualizations. By displaying numerical and categorical features, the titanic dataset from Kaggle was observed and investigated. The correlation matrix was also computed in order to analyze potential target leakage in the dataset. Moreover, visualizations such as the heat map index, was also used to provide insights of the dataset.