# Initial data exploration

In this notebook I am going to load the data, create useful insights about it and based upon this create scripts for data preprocessing and model training

> Note: before running this notebook, it is important to download the data (please, refer to corresponding [README](../data/README.md))

## Loading the data

In [1]:
from pprint import pprint
import pandas as pd

Reading genres from corresponding file

In [2]:
genres = []

with open("../data/raw/ml-100k/u.genre") as genres_file:
    for genre in genres_file.readlines():
        # check for emptiness
        if genre.strip():
            genres.append(genre.split("|")[0])

pprint(genres)
assert len(genres) == 19, "number of genres including unknown should be 19"

['unknown',
 'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']


Creating dataframes

> created special function to quckly write column names 

`user id | age | gender | occupation | zip code`

`->`

`["user_id", "age", "gender", "occupation", "zip_code"]`

```python
lambda x: "[" + ", ".join([f'"{el.strip().replace(" ", "_")}"' for el in x.split("|")]) + "]"
```

In [3]:
users_df = pd.read_csv(
    "../data/raw/ml-100k/u.user",
    sep="|",
    header=None,
    names=["user_id", "age", "gender", "occupation", "zip_code"],
    index_col="user_id",
)
items_df = pd.read_csv(
    "../data/raw/ml-100k/u.item",
    sep="|",
    # there was some problem with utf-8, solved according to the stackoverflow answer
    # https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
    encoding="latin-1",
    header=None,
    names=[
        "movie_id",
        "movie_title",
        "release_date",
        "video_release_date",
        "IMDb_URL",
    ]
    + genres,
    index_col="movie_id",
)
users_to_items_df = pd.read_csv(
    "../data/raw/ml-100k/u.data",
    sep="\t",
    header=None,
    names=["user_id", "item_id", "rating", "timestamp"],
)
# for viewing pleasure
users_to_items_df["timestamp"] = pd.to_datetime(
    users_to_items_df["timestamp"],
    unit="s",
)

In [4]:
users_df.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [5]:
items_df.head()

Unnamed: 0_level_0,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [6]:
users_to_items_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


Now, let's join the data together

In [7]:
joined_df = users_to_items_df.join(other=users_df).join(other=items_df)
joined_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,1997-12-04 15:55:49,,,,,,,...,,,,,,,,,,
1,186,302,3,1998-04-04 19:22:22,24.0,M,technician,85711.0,Toy Story (1995),01-Jan-1995,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22,377,1,1997-11-07 07:18:36,53.0,F,other,94043.0,GoldenEye (1995),01-Jan-1995,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,244,51,2,1997-11-27 05:02:03,23.0,M,writer,32067.0,Four Rooms (1995),01-Jan-1995,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,166,346,1,1998-02-02 05:33:16,24.0,M,technician,43537.0,Get Shorty (1995),01-Jan-1995,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Exploration

In [16]:
users_to_items_df = pd.read_csv(
    "../data/raw/ml-100k/ua.test",
    sep="\t",
    header=None,
    names=["user_id", "item_id", "rating", "timestamp"],
)
# for viewing pleasure
users_to_items_df["timestamp"] = pd.to_datetime(
    users_to_items_df["timestamp"],
    unit="s",
)

In [28]:
# users which has at least 3 ratings more than 3
users_to_items_df.groupby("user_id").filter(lambda x: (x["rating"] > 3).sum() >= 3)

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,20,4,1998-02-14 04:51:23
1,1,33,4,1997-11-03 07:38:19
2,1,61,4,1997-11-03 07:33:40
3,1,117,3,1997-09-22 22:02:19
4,1,155,2,1997-11-03 07:30:01
...,...,...,...,...
9425,943,232,4,1998-02-28 04:24:27
9426,943,356,4,1998-02-28 04:19:58
9427,943,570,1,1998-02-28 04:28:45
9428,943,808,4,1998-02-28 04:24:28
