# IDS Assignment 2
Document your results as well as the way you obtained them in this jupyter notebook. A seperate report (pdf, word etc.) is _not_ required. However, it is necessary that you provide the python code leading to your results as well as textual answers to the assignment questions in this notebook. 

Do not change the general structure of this notebook, but you can add further markdown or code cells to explain your solutions if necessary. In the end, submit this file in moodle.

# Preprocessing and Data Quality 


### Question 1 (Order cancellations)
Invoices with a InvoiceNo starting with the letter ‘c’ are order cancellations. Would you recommend keeping the order cancellation in your data set? Also provide a reason for your recommandation. 

Your answer: The first reason is that we don't need the information if an order was cancelled in the following exercises 'clustering and association rules. The second reason is that it makes no sense keeping the order cancellation because it would distort the results of the 'clustering' and 'association rules' part. Because the orders were cancelled it is no real data, because it never happened.

In [1]:
#Modify the data set according to your recommendation
import pandas as pd
import numpy as np

In [2]:
data = pd.read_excel('Assignment2 Datasets/Online Retail.xlsx')

In [3]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
data.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [5]:
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
mask = data['InvoiceNo'].str.contains(case=False, pat='c', regex=False)
new_data = data[~mask]

In [6]:
# Number of rows of the original data set and after removing rows with 'C'
print(data.shape)
print(new_data.shape)

(541909, 8)
(532621, 8)


### Question 2 (Empty values)
The attributes Description and CustomerID contain empty values. The Country attribute contains an “unspecified” value. For each of the three attributes reason how you would handle these values and why. 

Your answer: 

Description: Keep attributes with empty description, because there is also the StockCode that identifies the item. Also there only 1454 rows that have an empty description.

CustomerID: We don't need CustomerID in the following exercises, so we don't have to remove the rows with empty CustomerID. Also there are 134698 rows with empty CustomerID, so if we removed all of them then we would have a noticable smaller data set and we would get different results than with the original data set.

Country: Because we need the Country in the 'association rules' part we have to know the Country, so we remove rows with 'Unspecified' country. Also there are only 446 rows with 'Unscpecified' country, so removing these rows doesn't make the data set noticably smaller.

In [7]:
pd.isnull(new_data['UnitPrice']).sum()

0

In [8]:
#Modify the data set according to your recommendation

new_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [9]:
print('Empty Description:', pd.isnull(new_data['Description']).sum())
print('Empty CustomerID:', pd.isnull(new_data['CustomerID']).sum())
print('Unscpecified Country:', new_data[new_data['Country'] == 'Unspecified'].shape[0])

Empty Description: 1454
Empty CustomerID: 134697
Unscpecified Country: 446


In [10]:
print(new_data.shape)
new_data2 = new_data[new_data['Country'] != 'Unspecified']
print(new_data2.shape)

(532621, 8)
(532175, 8)


### Question 3 (Outliers/Noise)
Explore into the attributes Quantity and UnitPrice by plotting each attribute visually. Do these attributes contain noise and/or outliers? If so, reason how you would handle them and modify your data set accordingly.


In [11]:
# First we remove all quantities with values < 1 and all prices with values < 0.01.
import matplotlib.pyplot as plt
new_data3 = new_data2[new_data2['Quantity'] > 0]
new_data4 = new_data3[new_data3['UnitPrice'] > 0]

boxplot = plt.boxplot(new_data4['Quantity'], labels=['Quantity'], showmeans=True, meanline=True)
plt.show()
whiskers = [item.get_ydata()[1] for item in boxplot['whiskers']]


boxplot2 = plt.boxplot(new_data4['UnitPrice'], labels=['UnitPrice'], showmeans=True, meanline=True)
plt.show()
whiskers2 = [item.get_ydata()[1] for item in boxplot2['whiskers']]

print(whiskers)

cleanedData = new_data4[(new_data4['Quantity'] >= whiskers[0]) & (new_data4['Quantity'] <= whiskers[1])]
cleanedData2 = cleanedData[(cleanedData['UnitPrice'] >= whiskers2[0]) & (cleanedData['UnitPrice'] <= whiskers2[1])]

cleanedData2.describe()

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

[1.0, 23.0]


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,435801.0,435801.0,320822.0
mean,4.938421,2.699396,15353.239117
std,4.508542,1.92738,1703.201674
min,1.0,0.001,12347.0
25%,1.0,1.25,14049.0
50%,3.0,2.08,15298.0
75%,8.0,3.75,16873.0
max,23.0,8.33,18287.0


In [12]:
from sklearn import preprocessing

unit_price_array = np.array(cleanedData2['UnitPrice']).reshape(-1,1)

discretizer = preprocessing.KBinsDiscretizer(n_bins=21, encode='ordinal', strategy = 'uniform')
discretizer.fit(unit_price_array)

discretized_unit_price = discretizer.transform(unit_price_array).reshape(1,-1)

#showing the transformed data
print((discretized_unit_price+1)*0.4 - 0.2)
new_unit_price = (discretized_unit_price+1)*0.4 - 0.2
new_unit_price = new_unit_price.reshape(-1)
#displaying the edges of each bin per attribute
print(discretizer.bin_edges_)

[[2.6 3.4 2.6 ... 4.2 4.2 5. ]]
[array([1.00000000e-03, 3.97619048e-01, 7.94238095e-01, 1.19085714e+00,
       1.58747619e+00, 1.98409524e+00, 2.38071429e+00, 2.77733333e+00,
       3.17395238e+00, 3.57057143e+00, 3.96719048e+00, 4.36380952e+00,
       4.76042857e+00, 5.15704762e+00, 5.55366667e+00, 5.95028571e+00,
       6.34690476e+00, 6.74352381e+00, 7.14014286e+00, 7.53676190e+00,
       7.93338095e+00, 8.33000000e+00])]


In [13]:
cleanedData2

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
10,536367,22745,POPPY'S PLAYHOUSE BEDROOM,6,2010-12-01 08:34:00,2.10,13047.0,United Kingdom


In [14]:
cleanedData3 = cleanedData2
cleanedData3['UnitPrice'] = pd.Series(new_unit_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [15]:
pd.isnull(cleanedData['UnitPrice']).sum()

0

In [16]:
cleanedData3.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [17]:
cleanedData.shape

(473353, 8)

In [18]:
cleanedData.shape

(473353, 8)

In [19]:
cleanedData3.shape

(435801, 8)

In [20]:
mask4 = pd.isnull(cleanedData['UnitPrice'])
cleanedData[~mask4].shape

(473353, 8)

In [21]:
mask4 = pd.isnull(cleanedData3['UnitPrice'])
cleanedData3[~mask4].shape

(347546, 8)

In [22]:
mask3 = pd.isnull(cleanedData3['UnitPrice'])
cleanedData3[mask3]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
435801,574076,23302,KNEELING MAT HOUSEWORK DESIGN,1,2011-11-02 15:38:00,,,United Kingdom
435802,574076,23307,SET OF 60 PANTRY DESIGN CAKE CASES,3,2011-11-02 15:38:00,,,United Kingdom
435803,574076,23308,SET OF 60 VINTAGE LEAF CAKE CASES,1,2011-11-02 15:38:00,,,United Kingdom
435804,574076,23312,VINTAGE CHRISTMAS GIFT SACK,2,2011-11-02 15:38:00,,,United Kingdom
435805,574076,23321,SMALL WHITE HEART OF WICKER,1,2011-11-02 15:38:00,,,United Kingdom
435806,574076,23322,LARGE WHITE HEART OF WICKER,13,2011-11-02 15:38:00,,,United Kingdom
435807,574076,23328,SET 6 SCHOOL MILK BOTTLES IN CRATE,12,2011-11-02 15:38:00,,,United Kingdom
435808,574076,23340,VINTAGE CHRISTMAS CAKE FRILL,1,2011-11-02 15:38:00,,,United Kingdom
435809,574076,23343,JUMBO BAG VINTAGE CHRISTMAS,1,2011-11-02 15:38:00,,,United Kingdom
435810,574076,23344,JUMBO BAG 50'S CHRISTMAS,6,2011-11-02 15:38:00,,,United Kingdom


In [23]:
cleanedData2.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.6,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.4,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.6,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.4,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.4,17850.0,United Kingdom


Your explanation: Both attributes 'Quantity' and 'UnitPrice' have outliers. We remove these outliers be removing the values above the upper fence and below the lower fence. Only the attribute 'UnitPrice' has noise. We handle the noise by binning the data.

# Visualization

### Question 4 (Stream graph)
Create a stream graph that visualizes the number  of purchases (invoices) per country over time.

4. (a) Modify the data set to only contain purchases made in the countries Belgium, Ireland (EIRE), France, Germany, the Netherlands, Norway, Portugal, Spain and Switzerland.

In [24]:
#your modification

country_data = cleanedData3[(cleanedData3['Country'] == 'Belgium')
                           | (cleanedData3['Country'] == 'EIRE')
                           | (cleanedData3['Country'] == 'France')
                           | (cleanedData3['Country'] == 'Germany')
                           | (cleanedData3['Country'] == 'Netherlands')
                           | (cleanedData3['Country'] == 'Norway')
                           | (cleanedData3['Country'] == 'Portugal')
                           | (cleanedData3['Country'] == 'Spain')
                           | (cleanedData3['Country'] == 'Switzerland')]

print(country_data.shape)
country_data.head()

(25532, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
28,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.8,12583.0,France
29,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,1.8,12583.0,France
33,536370,21035,SET/2 RED RETROSPOT TEA TOWELS,18,2010-12-01 08:45:00,2.6,12583.0,France
38,536370,22661,CHARLOTTE BAG DOLLY GIRL DESIGN,20,2010-12-01 08:45:00,1.0,12583.0,France
41,536370,21913,VINTAGE SEASIDE JIGSAW PUZZLES,12,2010-12-01 08:45:00,2.2,12583.0,France


In [25]:
country_data.reset_index(drop=True, inplace=True)

In [26]:
country_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.8,12583.0,France
1,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,1.8,12583.0,France
2,536370,21035,SET/2 RED RETROSPOT TEA TOWELS,18,2010-12-01 08:45:00,2.6,12583.0,France
3,536370,22661,CHARLOTTE BAG DOLLY GIRL DESIGN,20,2010-12-01 08:45:00,1.0,12583.0,France
4,536370,21913,VINTAGE SEASIDE JIGSAW PUZZLES,12,2010-12-01 08:45:00,2.2,12583.0,France
5,536527,22809,SET OF 6 T-LIGHTS SANTA,6,2010-12-01 13:04:00,3.0,12662.0,Germany
6,536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,2010-12-01 13:04:00,1.8,12662.0,Germany
7,536527,84945,MULTI COLOUR SILVER T-LIGHT HOLDER,12,2010-12-01 13:04:00,1.8,12662.0,Germany
8,536527,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,2010-12-01 13:04:00,4.2,12662.0,Germany
9,536527,22244,3 HOOK HANGER MAGIC GARDEN,12,2010-12-01 13:04:00,4.2,12662.0,Germany


4. (b) Modify the data set such that it shows per month for each country how many purchases were made (i.e. how many invoices were created).

In [27]:
#your modification

country_data['InvoiceNo'].unique().shape

(1399,)

In [28]:
country_data.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [29]:
country_data['InvoiceDate'] = country_data['InvoiceDate'].map(lambda x: x.month)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [30]:
pd.isnull(country_data['InvoiceDate']).sum()

0

In [31]:
newdf = country_data.groupby(['Country', 'InvoiceDate']).nunique()
newdf

Unnamed: 0_level_0,Unnamed: 1_level_0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Country,InvoiceDate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Belgium,1,3,39,39,7,1,13,3,1
Belgium,2,8,72,72,8,1,16,7,1
Belgium,3,9,101,101,11,1,14,8,1
Belgium,4,6,88,88,11,1,14,6,1
Belgium,5,8,91,91,11,1,14,7,1
Belgium,6,11,149,149,11,1,16,11,1
Belgium,7,5,77,76,8,1,17,4,1
Belgium,8,8,118,119,11,1,18,7,1
Belgium,9,6,117,117,10,1,15,6,1
Belgium,10,11,175,175,10,1,18,9,1


In [32]:
newdf['InvoiceNo'].values

array([ 3,  8,  9,  6,  8, 11,  5,  8,  6, 11,  9,  9,  7, 13, 19, 10, 21,
       23, 22, 17, 33, 27, 40, 26, 25, 21, 23, 11, 35, 29, 22, 24, 44, 28,
       62, 37, 31, 18, 28, 20, 39, 24, 30, 36, 39, 54, 57, 44,  3,  3,  6,
        1,  7, 11,  1,  3,  6,  6,  9,  4,  2,  3,  3,  1,  2,  6,  7,  5,
        5,  4,  2,  4,  2,  1,  3,  3,  1,  3,  7,  7, 10,  7,  4,  7,  3,
        3,  8,  6,  9,  8,  8, 12,  5,  3,  4,  2,  2,  3,  4,  6,  3,  5,
        7,  5,  2])

In [33]:
# the values for our x-axis
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
# the values that will be stacked on top of each other
y1 = [1, 1, 2, 3, 5]
y2 = [0, 4, 2, 6, 8]
y3 = [1, 3, 5, 7, 9]

# the labels for y1, y2 and y3
labels = ["Fibonacci ", "Evens", "Odds"]

#stacking our values vertically
y = numpy.vstack([y1, y2, y3])

fig, ax = plt.subplots()
#modifying the axis
ax.stackplot(x, y1, y2, y3, labels=labels, baseline='wiggle')
ax.legend(loc='upper left')
plt.show()

NameError: name 'numpy' is not defined

4. (c) Use the modified data to create a stream graph. 

In [None]:
#your code

4. (d) Use this graph to compare the purchases made by each country. 

Your answer:

### Question 5 (Heat map)
Create a heat map that visualizes how much (in sterling) each country purchases per month. 

5. (a) Modify the data set to only contain purchases made in the countries Belgium, Ireland (EIRE), France, Germany, the Netherlands, Norway, Portugal, Spain and Switzerland. (Or use the version of the data set that you created for question 4 a).

In [None]:
#your modification

country_data = cleanedData3[(cleanedData3['Country'] == 'Belgium')
                           | (cleanedData3['Country'] == 'EIRE')
                           | (cleanedData3['Country'] == 'France')
                           | (cleanedData3['Country'] == 'Germany')
                           | (cleanedData3['Country'] == 'Netherlands')
                           | (cleanedData3['Country'] == 'Norway')
                           | (cleanedData3['Country'] == 'Portugal')
                           | (cleanedData3['Country'] == 'Spain')
                           | (cleanedData3['Country'] == 'Switzerland')]

country_data.shape

In [None]:
cleanedData3['Country'].unique()

5. (b) Modify the data set such that it shows per month how much money (in sterling) was spent in the shop per country.

In [None]:
#your modification

5. (c) Use the modified data to create a heat map. 

In [None]:
#your code

5. (d) Compare the amount of the purchases over time and between each country. 

Your answer:

### Question 6 (Interpretation)
Compare the results obtained from the stream graph and the heat map. Is there a relation between the number of purchases and the amount purchased in sterling?


Your answer:

# Clustering
Presume that the business analyst would like to cluster transactions with similar types of products into the same group (here don’t consider the quantity of the products). For each product, only use its ‘StockCode’ to represent it. All the results here should be based on the preprocessed data set obtained from question 1 to 3 of this assignment. Presume that this obtained data set from question 1 to 3 has a variable name ‘cluster_dataset’ and is expressed by Pandas DataFrame in your code.

### Question 7 (Data transformation and clustering)
7. (a) Calculate and show the number of occurrences of each product in data set   ‘cluster_dataset’. For example, if a product appears in a transaction, then its occurrence number will be increased by 1 (do not consider the quantity of this product here). Preserve the 100 most frequent products and remove all the other products in ‘cluster_dataset’. For example, if a row in ‘cluster_dataset’ contains unqualified product, then remove this row from ‘cluster_dataset’. Show the new ‘cluster_dataset’ in your result.


In [None]:
# your code

7. (b) Based on question a, please reorganize the data from ‘cluster_dataset’ and generate a new data set ‘cluster_dataset_new’ which has a suitable format (for k-means) for solving the transaction clustering problem mentioned above. Show the data from ‘cluster_dataset_new’ by using Pandas DataFrame in your result, where the index should be consistent with the values of 'InvoiceNo', the column name should be consistent with the values of 'StockCode' and each element in this DataFrame should have a value 0 or 1.

In [None]:
# your code

7. (c) Try values 2, 3, 4, 5 for parameter 'n_clusters' for the k-means function from Scikit-Learn over the data set ‘cluster_dataset_new’ generated in question b. Show the ‘within cluster variation’ (also called ‘sum of squared distances’) of the generated clusters for each different setting for ‘n_clusters’ in your result. Also write down the value that you have tried for setting 'n_clusters' which can help generate the best clustering results and explain how you make this decision.

In [None]:
# your code

Your explanation:

# Frequent Itemsets and Association Rules
 For the clusters output by k-means function with the best 'n_clusters' from question 7, the business analyst now would like to research on the frequent purchase behaviours and specific purchase rules for each cluster.
### Question 8 (Data transformation and mining frequent itemsets and association rules)
8. (a) Set the minimum support for finding the frequent purchase behaviours to 0.2. Please provide the business analyst with the qualified purchase behaviours. For each product, only use its ‘StockCode’ to represent it. Also show the data set prepared for each cluster for mining the frequent behaviours by using Pandas DataFrame in your result, the data set for the cluster k should have the variable name 'fpb_data_k' in your code.

In [None]:
# your code

8. (b) Furthermore, the business analyst would like to analyze the purchase behaviour of the citizens from ‘United Kingdom’ for each cluster. Specifically speaking, he wants to discover if there exist some rules which indicate that the citizens from ‘United Kingdom’ tend to buy some specific products for each cluster. Set the minimum support to 0.2, minimum confidence to 0.7. Please discover and show such rules (only show the rules with ‘United Kingdom’ appearing in antecedents in the rules) for each cluster for the business analyst. Also show the data sets prepared for each cluster for mining the relevant rules by using Pandas DataFrame in you result, the data set for cluster k should have the variable name 'r_data_k' in your code.

In [None]:
# your code

# Text Mining
### Question 12 (Model based on binary document-term matrix)
Perform preprocessing on the corpus (all lowercase, no punctuation, tokenization, stemming, stopword removal) and obtain a binary document-term matrix; train a logistic classifier.

In [None]:
# nltk's default stoplist:
from nltk.corpus import stopwords
stoplist = set(stopwords.words('english'))

# your code

### Question 13 (Model based on document-term matrix of counts)
Perform preprocessing on the corpus (all lowercase, no punctuation, tokenization, stemming, stopword removal) and obtain a document-term matrix of counts; train a logistic classifier.


In [None]:
# your code

### Question 14 (Model based on tf-idf document-term matrix)
Perform preprocessing on the corpus (all lowercase, no punctuation, tokenization, stemming, stopword removal) and obtain a tf-idf scores document-term matrix; train a logistic classifier.


In [None]:
# your code

### Question 15 (Model based on doc2vec)
Perform preprocessing on the corpus (all lowercase, no punctuation, tokenization, stemming, stopword removal) and obtain a doc2vec embedding in order to reduce the dimension of the document vector to 300; use the doc2vec model you just trained to convert the training set to a set of document vectors; train a logistic classifier.


In [None]:
# your code

### Question 16 (Evaluation)
16. (a) Predict the classification with the four models on the test data.


In [None]:
# your code

16. (b) Obtain confusion matrices for the four different models.


In [None]:
# your code

16. (c) Obtain accuracy and f1 score for the four different models.


In [None]:
# your code

16. (d) Briefly comment on the quality of the predictions for the four models.

_Your comment:_


# Process Mining
For this part, refer to the online docs of pm4py. You will find particularly of interest the documentation on filtering (https://pm4py.github.io/filtering.html, or on the new website http://pm4py.pads.rwth-aachen.de/documentation/filtering-logs/). 
important: if you did not do it in the instruction, you should make sure to have the latest pm4py version: to get it is sufficient to type `pip install pm4py --upgrade` from any terminal emulator on Windows  (command prompt, PowerShell, etc) or any terminal on *nix systems. For the details, refer to the study guide and the Process Mining instruction.
### Question 17 (Trace frequency)
17. (a) Use the provided event log and identify the least frequent traces and the most frequent traces.


In [None]:
# your code

### Question 18 (Process Discovery and Conformance Checking using first filtered event log)
18. (a) Remove the two least frequent traces and create a new event log out of the original event log without the two least frequent traces.

In [None]:
# your code

18. (b) Use Inductive miner algorithm to discover the process model based on you new event log (the filtered log without two least frequent traces).


In [None]:
# your code

18. (c) Do the token replay conformance checking using your discovered model and the original event log. Does your process model fit?


In [None]:
# your code

Your explanation:

18. (ci) Calculate the fitness of your model.

In [None]:
# your code

18. (cii) Are there any deviations between the process model and the event log?

Your explanation:

### Question 19 (Process Discovery and Conformance Checking using second filtered event log)
19. (a) Now use the original event log and remove the two most frequent traces, and discover the model based on your new event log(the filtered log without two most frequent traces).


In [None]:
# your code

19. (b) Do the token replay conformance checking using your newly discovered model and the original event log. Does your process model fit?

In [None]:
# your code

Your explanation:

19. (bi) Calculate the fitness of your model?

In [None]:
# your code

19. (bii)  Is there any deviation inside the process model?

Your explanation:

### Question 20 (Process Discovery using complete log)
20. (a) Use the complete event log (original event log) and discover your process model using inductive miner.


In [None]:
# your code

20. (b) Do the token replay conformance checking using your newly discovered model and the original event log. Does your process model fit?

In [None]:
# your code

Your explanation:

20. (c) How are these three discovered process models different from each other? Which model is the best fitting to the original log? Why?

Your explanation: