## Data Analysis, Preprocessing, and Exploratory Data Analysis (EDA)

### Introduction
This tutorial covers the importance of data analysis, preprocessing, and exploratory data analysis (EDA). It includes various steps for data cleaning, transformation, visualization, and feature engineering.

### Step 1: Data Preprocessing
Data preprocessing involves cleaning and preparing data for analysis. This includes handling missing values, removing outliers, and transforming data types.

#### Checking for Missing Values

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
data_url = 'https://path-to-your-dataset.csv'
data = pd.read_csv(data_url)

# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]

#### Handling Missing Values
If missing values are found, they can be handled by removing the rows/columns or by replacing them with appropriate values (e.g., mean, median, mode).

In [None]:
# Fill missing values with the mean of the column
data.fillna(data.mean(), inplace=True)
# Verify that there are no more missing values
data.isnull().sum().sum()

#### Removing Outliers
Outliers can be detected and removed to prevent them from skewing the analysis results.

In [None]:
# Function to remove outliers based on z-score
from scipy.stats import zscore

def remove_outliers(df, column):
    z_scores = zscore(df[column])
    abs_z_scores = np.abs(z_scores)
    filtered_entries = (abs_z_scores < 3)
    return df[filtered_entries]

# Example of removing outliers from a numerical column
data = remove_outliers(data, 'numerical_column')

#### Data Transformation
Ensure all columns have appropriate data types. For instance, convert numerical columns to integers or floats if needed.

In [None]:
# Convert a column to integer type
data['integer_column'] = data['integer_column'].astype(int)
# Convert a column to float type
data['float_column'] = data['float_column'].astype(float)

### Step 2: Exploratory Data Analysis (EDA)
EDA involves analyzing the main characteristics of the data, often using visual methods. It helps in understanding the data distribution, relationships between variables, and identifying patterns.

#### Summary Statistics
Calculate basic statistical measures such as mean, median, and standard deviation for numerical columns.

In [None]:
# Summary statistics of the dataset
summary_statistics = data.describe()
summary_statistics

#### Data Visualization
Visualize the distribution of numerical features and the relationships between variables using various plots.

##### Histogram
Visualize the distribution of a numerical feature.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of a numerical feature
plt.figure(figsize=(10, 6))
sns.histplot(data['numerical_feature'], kde=True, bins=30)
plt.title('Distribution of Numerical Feature')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

##### Scatter Plot
Visualize the relationship between two numerical features.

In [None]:
# Scatter plot of two numerical features
plt.figure(figsize=(10, 6))
sns.scatterplot(x='feature1', y='feature2', data=data)
plt.title('Relationship between Feature1 and Feature2')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()

##### Bar Plot
Visualize the count of each category in a categorical feature.

In [None]:
# Bar plot of a categorical feature
plt.figure(figsize=(10, 6))
sns.countplot(x='categorical_feature', data=data)
plt.title('Count of Categories in Categorical Feature')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

##### Pie Chart
Visualize the proportion of each category in a categorical feature.

In [None]:
# Pie chart of a categorical feature
category_counts = data['categorical_feature'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Proportion of Categories in Categorical Feature')
plt.show()

##### Heatmap
Visualize the correlation matrix of the features in the dataset.

In [None]:
# Heatmap of the correlation matrix
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Features')
plt.show()

### Step 3: Feature Engineering
Feature engineering involves creating new features from existing ones to improve the performance of machine learning models.

#### Creating New Features

In [None]:
# Creating new features
data['Total Curricular Units Completed'] = data['Curricular units 1st sem (approved)'] + data['Curricular units 2nd sem (approved)']
data['Average Grade'] = (data['Curricular units 1st sem (grade)'] + data['Curricular units 2nd sem (grade)']) / 2
data['Units Passed Ratio'] = data['Total Curricular Units Completed'] / (data['Curricular units 1st sem (enrolled)'] + data['Curricular units 2nd sem (enrolled)'])

# Display the new features
data[['Total Curricular Units Completed', 'Average Grade', 'Units Passed Ratio']].head()

#### Handling Infinite and NaN Values
Replace infinite values with the mean and fill NaN values.

In [None]:
# Replace infinite values with NaN
data['Units Passed Ratio'].replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill NaN values with the mean
mean_units_passed_ratio = data['Units Passed Ratio'].mean()
data['Units Passed Ratio'].fillna(mean_units_passed_ratio, inplace=True)

# Check the updated statistics
data[['Total Curricular Units Completed', 'Average Grade', 'Units Passed Ratio']].describe()

### Step 4: Advanced Analysis
Perform advanced data analysis such as regression, clustering, and hypothesis testing.

#### Regression Analysis
Use linear regression to explore the relationship between independent variables and the target variable.

In [None]:
from sklearn.linear_model import LinearRegression

# Define features and target
X = data[['feature1', 'feature2', 'feature3']]
y = data['target']

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Print the coefficients
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

#### Clustering
Use clustering algorithms like K-means to identify groups within the data.

In [None]:
from sklearn.cluster import KMeans

# Define features for clustering
X = data[['feature1', 'feature2']]

# Initialize and train the model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Add cluster labels to the data
data['Cluster'] = kmeans.labels_

# Visualize the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='feature1', y='feature2', hue='Cluster', data=data, palette='viridis')
plt.title('Clusters of Data')
plt.show()

#### Hypothesis Testing
Use hypothesis testing to compare groups within the data.

In [None]:
from scipy.stats import ttest_ind

# Define two groups
group1 = data[data['categorical_feature'] == 'Category1']['numerical_feature']
group2 = data[data['categorical_feature'] == 'Category2']['numerical_feature']

# Perform t-test
t_stat, p_value = ttest_ind(group1, group2)
print('T-statistic:', t_stat)
print('P-value:', p_value)

### Step 5: Data Visualization
Create compelling visualizations to effectively communicate the findings.

#### Visualizing Feature Importances
Use Random Forest to evaluate feature importances.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define features and target
X = data.drop('target', axis=1)
y = data['target']

# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)

# Visualize feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

### Summary
This tutorial covered the following key concepts:
- Data Preprocessing: Handling missing values, removing outliers, and data transformation.
- Exploratory Data Analysis (EDA): Summary statistics and data visualization.
- Feature Engineering: Creating new features and handling infinite/NaN values.
- Advanced Analysis: Regression, clustering, and hypothesis testing.
- Data Visualization: Creating compelling visualizations to communicate findings.