# Machine Learning EDA Snippets
This notebook contains various Python code snippets commonly used in Machine Learning Exploratory Data Analysis (EDA). Each snippet is categorized for better understanding.

## 1. Data Loading and Basic Exploration
Below are some useful code snippets for loading and performing basic exploration of datasets.

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Loading a CSV file
df = pd.read_csv('your_data.csv')
df.head()  # Displays the first 5 rows

In [None]:
# Checking the shape of the dataset
df.shape  # Returns (rows, columns)

In [None]:
# Getting summary statistics of numerical columns
df.describe()

In [None]:
# Checking for missing values in the dataset
df.isnull().sum()

In [None]:
# Displaying the data types of each column
df.dtypes

## 2. Handling Missing Data
These snippets help you handle missing data effectively.

In [None]:
# Filling missing values with the mean of the column
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

In [None]:
# Dropping rows with missing values
df.dropna(inplace=True)

In [None]:
# Filling missing categorical values with the mode
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)

In [None]:
# Forward fill to propagate the last valid observation forward
df.fillna(method='ffill', inplace=True)

## 3. Data Transformation and Feature Engineering
This section contains snippets to help with feature engineering and data transformation.

In [None]:
# Log transformation for skewed data
df['log_column'] = np.log(df['column_name'] + 1)

In [None]:
# Creating a new feature from existing features
df['new_feature'] = df['feature1'] * df['feature2']

In [None]:
# Binning continuous data into categories
df['binned_column'] = pd.cut(df['column_name'], bins=[0, 10, 20, 30], labels=['Low', 'Medium', 'High'])

In [None]:
# Label encoding for categorical variables
df['encoded_column'] = df['categorical_column'].astype('category').cat.codes

## 4. Data Visualization
Visualizations are crucial for understanding data patterns.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Plotting a histogram for a numerical column
plt.hist(df['column_name'], bins=20)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of column_name')
plt.show()

In [None]:
# Creating a boxplot for a numerical column
sns.boxplot(x=df['column_name'])
plt.title('Boxplot of column_name')
plt.show()

In [None]:
# Correlation matrix heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Scatter plot for two numerical columns
plt.scatter(df['column1'], df['column2'])
plt.xlabel('column1')
plt.ylabel('column2')
plt.title('Scatter plot between column1 and column2')
plt.show()

## 5. Handling Outliers
Outliers can significantly affect your models, and handling them is crucial.

In [None]:
# Identifying outliers using IQR
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR))]

In [None]:
# Removing outliers from the dataset
df = df[~((df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR)))]

## 6. Bucketing and Grouping
Grouping and bucketing are useful for summarizing data.

In [None]:
# Bucketing ages into categories
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 35, 60, 100], labels=['Child', 'Teen', 'Adult', 'Middle Aged', 'Senior'])

In [None]:
# Grouping data and calculating mean
grouped_data = df.groupby('category_column')['value_column'].mean().reset_index()

In [None]:
# Pivot table for summarizing data
pivot_table = pd.pivot_table(df, values='value_column', index='category_column', columns='another_category', aggfunc='mean')

## 7. Regular Expressions (Regex)
Regex is powerful for string manipulation and extraction.

In [None]:
import re

In [None]:
# Extracting numerical values from a string column
df['extracted_values'] = df['string_column'].str.extract('(\d+)')

In [None]:
# Replacing patterns in a string column
df['cleaned_column'] = df['string_column'].str.replace(r'\D+', '')

In [None]:
# Checking if a column contains a specific pattern
df['contains_pattern'] = df['string_column'].str.contains(r'^pattern.*$')

## 8. Advanced Feature Engineering
Creating features that improve model performance.

In [None]:
# Polynomial feature creation
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])
df_poly = pd.DataFrame(poly_features, columns=poly.get_feature_names(['feature1', 'feature2']))

In [None]:
# Interaction features
df['interaction_feature'] = df['feature1'] * df['feature2']

In [None]:
# Time-based features from datetime
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day
df['weekday'] = df['date_column'].dt.weekday