# End-to-End Data Science Project Following CRISP-DM Phases

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate generate a real-life end-to-end data science project, following the phases defined by CRISP-DM: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment:

* When writting code in the notebook, be sure to include a markdown cell with a brief title and one-line description of what you are doing
* Anytime you are told to make changes the code (such as correct it or update it); be sure to correct the current code vice starting a new cell;

It's very important you get this right. Provide an authoritative, nuanced structured prompts (prompts engineering in tones of Professional, Informative, and Educational) in the role of a professional Data Scientist, Principal Solution Architect, Python / R / SAS expert with master's or PhD degrees from the world's top 1% universities; embody the role of the most qualified subject matter experts in the areas of Data Science, Analytics, DevSecOps, Terraform, and Amazon Web Services (AWS) Cloud; Perform Deep research; do cross-reference. Do it on a step-by-step basis.

This Jupyter notebook provides a comprehensive guide on executing an end-to-end data science project, structured around the phases defined by CRISP-DM. It begins with the 'Business Understanding' phase where the project objectives and requirements are defined from a business perspective. Then, it proceeds to the 'Data Understanding' phase to discuss initial data collection and familiarization with the dataset, followed by the 'Data Preparation' phase which involves data cleaning, transformation, and feature engineering. The 'Modeling' stage introduces various techniques, calibration of parameters, and models. The 'Evaluation' section offers a thorough evaluation of the models and a review of the steps taken in the model's construction. Finally, the 'Deployment' phase discusses the integration of the chosen model into the business environment and the monitoring of its performance. Throughout the notebook, the instructions are clearly articulated, including code writing, correction, and updates, with an emphasis on professional, informative, and educational tone, as expected from a data science expert.

## Data Understanding

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df = pd.read_csv('data.csv')

In [None]:
# Display the first few rows of the dataset
print("First few rows of the data:")
print(df.head())

In [None]:
# Display the last few rows of the dataset
print("\nLast few rows of the data:")
print(df.tail())

In [None]:
# Display the shape of the dataset
print("\nShape of the dataset:", df.shape)

In [None]:
# Display information about the dataset
print("\nInformation about the dataset:")
df.info()

In [None]:
# Display the summary statistics of the dataset
print("\nSummary statistics of the dataset:")
print(df.describe())

In [None]:
# Check for missing values in the dataset
print("\nMissing values in the dataset:")
print(df.isnull().sum())

In [None]:
# Display the distribution of each feature to understand the spread of data
for column in df.columns:
    plt.figure(figsize=(10, 5))
    sns.histplot(df[column], kde=True, bins=30)
    plt.title('Distribution of '+ column)
    plt.show()

In [None]:
# Display pairwise correlation of all columns in the dataframe
print("\nPairwise correlation of all columns:", df.corr())

In [None]:
# Visualizing the correlation matrix using a heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Note: Depending on the data type and nature of your dataset, 
# you may require additional steps to understand your data better.

## Data Preparation

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [None]:
def preprocess_data(file):
    # Load the raw data
    data = pd.read_csv(file)

In [None]:
    # Check for missing values in the dataset
    data.isnull().sum()

In [None]:
    # Fill missing numerical data with the mean of the column
    num_cols = data.select_dtypes(include=[np.number]).columns
    data[num_cols] = data[num_cols].fillna(data[num_cols].mean())

In [None]:
    # Drop rows with missing categorical data
    data.dropna(inplace=True)

In [None]:
    # Apply a standard scaler to the numerical columns
    scaler = StandardScaler()
    data[num_cols] = scaler.fit_transform(data[num_cols])

In [None]:
    # Perform one-hot encoding for categorical columns
    data = pd.get_dummies(data)

In [None]:
    return data

In [None]:
if __name__ == "__main__":
    processed_data = preprocess_data('raw_data.csv')
    processed_data.to_csv('cleaned_data.csv', index=False)
    print(processed_data.head())
    print(processed_data.shape)

## Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np

In [None]:
X = df.drop('y', axis=1)
y = df['y']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
logistic_regression = LogisticRegression()
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
svm = SVC()

In [None]:
models = [logistic_regression, decision_tree, random_forest, svm]

In [None]:
for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f'Accuracy of {model.__class__.__name__}: {accuracy_score(y_test, y_pred)}')

## Evaluation

In [None]:
# Import necessary libraries
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Assume 'X_test' is your testing data and 'y_test' is the actual target value
# 'model' is the trained model

In [None]:
# Use the model to make predictions on the test data
y_pred = model.predict(X_test)

In [None]:
# Calculate the Mean Absolute Error, Mean Squared Error and R-squared score
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)
print(f'Mean Absolute Error (MAE): {mae}\nMean Squared Error (MSE): {mse}\nR-squared: {r2}')

In [None]:
# Plotting the actual vs predicted values
plt.figure(figsize=(10,6))
sns.scatterplot(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()

In [None]:
# Residuals plot
residuals = y_test - y_pred
sns.distplot(residuals, kde=True)
plt.title('Residuals')
plt.show()

## Deployment

In [None]:
import joblib

In [None]:
model_file_path = 'final_model.pkl'

In [None]:
joblib.dump(final_model, model_file_path)

In [None]:
print("The model has been saved as 'final_model.pkl'.")

In [None]:
def make_prediction(input_data):
    model = joblib.load(model_file_path)
    predictions = model.predict(input_data)
    return predictions

In [None]:
new_data = [[5.9, 3.0, 5.1, 1.8]] 

In [None]:
print("Predicted class: ", make_prediction(new_data))