# Base

## No alterations to createContentTrainingData.py
I set up training data by outputting it into a file, shuffling everything, then running `sed` and `tr` to lowercase everything.
For the first run, I ran default fasttext giving me the following results after my test for top 1:
```
N       9604
P@1     0.124
R@1     0.124
```
and for top 5:
```
N       9604
P@5     0.0463
R@5     0.232
```

After changing the learning rate to 1 and running 25 epochs, as well as setting Ngrams to 2, I got much better results:
```
N       9604
P@1     0.612
R@1     0.612
```
and for top 5:
```
N       9604
P@5     0.162
R@5     0.808
```

## Lowercasing all letters and removing punctuation
With the default fasttext parameters, we received a top 1 of the following:
```
N       9677
P@1     0.118
R@1     0.118
```
and a top 5 of the following:
```
N       9677
P@5     0.042
R@5     0.21
```
After changing the fast text parameters, we get the following for top 1:
```
N       9677
P@1     0.62
R@1     0.62
```
and for top 5:
```
N       9677
P@5     0.163
R@5     0.815
```
Since in my first test, I used `sed` and `tr` to lowercase letters, and the only thing I did was remove punctuation, I am not surprised these results are similar

## Using the NLTK Snowball Stemmer
With default fasttext parameters, I received a top 1 of the following:
```
N       9676
P@1     0.127
R@1     0.127
```
and a top 5 of the following:
```
N       9676
P@5     0.0431
R@5     0.216
```
After changing the fasttext parameters, I got the following for top 1:
```
N       9676
P@1     0.616
R@1     0.616
```
and for top 5:
```
N       9676
P@5     0.158
R@5     0.788
```

# Testing for min_products

In [2]:
import pandas as pd

In [67]:
data = {'category': ['a', 'b', 'c', 'a'], 'product':[1,2,3,4]}
df = pd.DataFrame(data=data)

In [71]:
df['category']

0    a
1    b
2    c
3    a
Name: category, dtype: object

In [94]:
# This is the winner! This removes all rows from the dataframe where the category has more than x records (x being 1 here)
categories = df['category'].value_counts()[lambda x: x <= 1].index
df.drop(df['category'].isin(categories)[lambda x: x].index, inplace=True)
[x for x in df.apply(lambda x: f"__label__{x['category']} {x['product']}", axis=1).values]

['__label__a 1', '__label__a 4']

In [10]:
df.groupby('category').filter(lambda x: len(x) > 1)['category']

0    a
3    a
Name: category, dtype: object

In [12]:
df[df.groupby("category")["category"].transform('size') > 1]['category']

0    a
3    a
Name: category, dtype: object

In [13]:
df[df['category'].map(df['category'].value_counts()) > 1]['category']

0    a
3    a
Name: category, dtype: object

# Minimum Products

## min_products = 2
### Default Parameters
Top 1
```
N       9682
P@1     0.118
R@1     0.118
```
Top 5
```
N       9682
P@5     0.0432
R@5     0.216
```
### Finer Training
Top 1
```
N       9682
P@1     0.612
R@1     0.612
```
Top 5
```
N       9682
P@5     0.162
R@5     0.808
```

## min_products = 5
### Default Parameters
Top 1
```
N       9698
P@1     0.128
R@1     0.128
```
Top 5
```
N       9698
P@5     0.0449
R@5     0.224
```
### Finer Training
Top 1
```
N       9698
P@1     0.609
R@1     0.609
```
Top 5
```
N       9698
P@5     0.161
R@5     0.806
```

## min_products = 10
### Default Parameters
Top 1
```
N       9811
P@1     0.123
R@1     0.123
```
Top 5
```
N       9811
P@5     0.0451
R@5     0.226
```
### Finer Training
Top 1
```
N       9811
P@1     0.627
R@1     0.627
```
Top 5
```
N       9811
P@5     0.166
R@5     0.83
```