# 1. Data Preparation
As a first step, I load all the modules that will be used in this notebook

In [2]:
# Install the missing wordcloud module
%pip install wordcloud

Note: you may need to restart the kernel to use updated packages.


# 1. Data Preparation
As a first step, I load all the modules that will be used in this notebook

In [3]:
# Importing essential libraries for data analysis, visualization, and machine learning
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical computations
import matplotlib as mpl  # For creating static visualizations
import matplotlib.pyplot as plt  # For plotting graphs
import seaborn as sns  # For creating attractive statistical plots
import datetime  # For handling date and time data
import nltk  # For natural language processing (if analyzing product descriptions)
import warnings  # To suppress unnecessary warnings
import matplotlib.cm as cm  # For color maps in visualizations
import itertools  # For efficient looping
from pathlib import Path  # For handling file paths
from sklearn.preprocessing import StandardScaler  # For scaling data
from sklearn.cluster import KMeans  # For K-means clustering
from sklearn.metrics import silhouette_samples, silhouette_score  # For evaluating clusters
from sklearn import preprocessing, model_selection, metrics, feature_selection  # For ML tasks
from sklearn.model_selection import GridSearchCV, learning_curve  # For hyperparameter tuning
from sklearn.svm import SVC  # For Support Vector Machine classification
from sklearn.metrics import confusion_matrix  # For evaluating classification performance
from sklearn import neighbors, linear_model, svm, tree, ensemble  # For various ML models
from wordcloud import WordCloud, STOPWORDS  # For creating word clouds (if analyzing text data)
from sklearn.ensemble import AdaBoostClassifier  # For AdaBoost classification
from sklearn.decomposition import PCA  # For dimensionality reduction
from IPython.display import display, HTML  # For displaying HTML content in Jupyter notebooks
import plotly.graph_objs as go  # For interactive visualizations
from plotly.offline import init_notebook_mode, iplot  # For rendering Plotly graphs in notebooks

# Initialize Plotly for interactive visualizations
init_notebook_mode(connected=True)

# Suppress warnings to keep the notebook clean
warnings.filterwarnings("ignore")

# Customizing matplotlib and seaborn for better visualizations
plt.rcParams["patch.force_edgecolor"] = True  # Ensure edges are visible in plots
plt.style.use('fivethirtyeight')  # Use a stylish and professional theme
mpl.rc('patch', edgecolor='dimgray', linewidth=1)  # Customize edge colors and line widths

# Enable inline plotting for matplotlib
%matplotlib inline

Then, I load the data. Once done, I also give some basic informations on the content of the dataframe: the type of the various variables, the number of null values and their percentage with respect to the total number of entries:


In [4]:
# Load the dataset
#__________________
# Read the datafile with specified encoding and data types
df_initial = pd.read_csv('dataset/data.csv', encoding="ISO-8859-1",
                         dtype={'CustomerID': str, 'InvoiceID': str})

# Print the dimensions of the dataframe
print('Dataframe dimensions:', df_initial.shape)

# Convert 'InvoiceDate' to datetime format for easier time-based analysis
df_initial['InvoiceDate'] = pd.to_datetime(df_initial['InvoiceDate'])

#____________________________________________________________
# Create a summary table to display column types, null values, and their percentages
tab_info = pd.DataFrame(df_initial.dtypes).T.rename(index={0: 'column type'})  # Column types
tab_info = pd.concat([tab_info, 
                     pd.DataFrame(df_initial.isnull().sum()).T.rename(index={0: 'null values (nb)'}),  # Number of null values
                     pd.DataFrame(df_initial.isnull().sum() / df_initial.shape[0] * 100).T.rename(index={0: 'null values (%)'})])  # Percentage of null values

# Display the summary table
display(tab_info)

#__________________
# Show the first few rows of the dataframe to get a sense of the data
display(df_initial.head())

Dataframe dimensions: (541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
column type,object,object,object,int64,datetime64[ns],float64,object,object
null values (nb),0,0,1454,0,0,0,135080,0
null values (%),0.0,0.0,0.268311,0.0,0.0,0.0,24.926694,0.0


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Ok, therefore, by removing these entries we end up with a dataframe filled at 100% for all variables! finally, I check for duplicate entries and delete them

In [5]:
print('Duplicate Entries: {}'.format(df_initial.duplicated().sum()))
df_initial.drop_duplicates(inplace = True)

Duplicate Entries: 5268


2. Exploring the content of variables

This dataframe contains 8 variables that correspond to:

- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.
---
2.1 Countries