# Data mining

# Lesson 1

# Cross-Industry Standard Process for Data Mining

### **Objective:**
To learn how to apply the CRISP-DM methodology to perform data analysis projects, including all steps from business understanding to model creation and evaluation.

### **What we will learn:**

1. Understanding the phases of CRISP-DM (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment).
2. Practical work with data analysis in Python.
3. Using libraries for data processing, modeling and evaluation.
4. Model building and its evaluation on real data.

### Libraries that we use:

- [Pandas](https://pandas.pydata.org/) - a library for working with tabular data, which will help us in the data preparation phase.
- [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) - for data visualization and identifying interesting patterns.
- [Scikit-learn](https://scikit-learn.org/stable/) - machine learning library for building and evaluating models.

### Structure of the laboratory work:

#### Phase 1: Business Understanding.

- We have sales data and want to predict which customers are most likely to make a purchase in the next month.

Our **sales_data.csv** with columns:

    "order_id" - unique identificator,
    "customer_name" - the name of the customer who placed the order,
    "purchase_date" - the date when the purchase was made,
    "category" - the customer’s category, which can be "Regular", "Premium", or "VIP",
    "product_category" - the main category of the product purchased, such as "Electronics", "Clothing", "Home", "Sports", or "Toys",
    "product_subcategory" - the specific subcategory of the product purchased, such as "Phones", "TV", "Shirts", "Shoes", "Furniture", "Bikes", or "Dolls",
    "quantity" - the number of units of the product purchased.
    "price_per_unit" - the price of each unit of the product, rounded to two decimal places

#### Phase 2: Data Understanding

**Exercise 1:** Analyze the data using descriptive statistics and visualization techniques. Determine if there are any data quality issues, such as outliers or omissions.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Download data
data = pd.read_csv('sales_data.csv')

# Description
print(data.describe())

# Checking for missing values
print(data.isnull().sum())

# Visualization
sns.histplot(data['sales'], bins=30)
plt.show()


#### Phase 3: Data Preparation

**Exercise 2:** Clean and prepare data for modeling. Delete or replace missing values, bring the data to the required format.

In [None]:
# Filling in missing values with the mean


# Conversion of categorical attributes into numerical attributes



#### Phase 4: Modeling

**Exercise 3:** Build a model to predict a target (e.g. sales forecast). Use machine learning techniques such as linear regression or classification.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# remove customer_name as it is not needed for the model

# Converting categorical data with OneHotEncoder


# Done

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


X = data.drop('sales', axis=1)
y = data['sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model


# Prediction


# Evaluation
# print(f'MSE: {mean_squared_error(y_test, predictions)}')


#### Phase 5: Evaluation

**Exercise 4:** Evaluate the quality of the model using metrics such as mean square error (MSE), precision or F1-measure depending on the type of problem.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluation


# print(f'Mean Squared Error: {mse}')
# print(f'R^2 Score: {r2}')

#### Phase 6: Deployment

**Exercise 5:** Describe how you can deploy the model into a working application or present the results to business users.
Example: Presenting results as a report, visualizations, or deploying the model to an API.

In [None]:
# plot
