# <center>Week 4 Assignment</center>

This week you will be using the MovieLens 1 million ratings dataset. By the time you are finished with this assignment, you will have another SQLite database and NoSQL database to use in other classes or for projects.

Broadly, this assignment will follow the FTE's progression:

* Load MovieLens tables into SQLite (good time to find that "multiple insert")
* Create a query to retrieve reviews into a cursor
* Create a dataclass that represents a movie review
* Translate rows of the cursor into a list of MovieReview objects
* Translate the list of MovieReviews into a list of dictionaries
* Load the list of dictionaries into TinyDB (using `insert_multiple()`)

There is one important point that will need to be addressed:

* MovieLens is comprised of 3 tables:
    * Users
    * Movies
    * Reviews

One complete review consists of data from all three tables joined together. We will work through that part together. 

<hr>

## Part 1 - Storing in SQLite

In this part, you are expected to read MovieLens's README file to find information to proceed. 

<div class="alert alert-block alert-info">
<b>Hint::</b> Jupyter notebook and JupyterLabs can open it.
</div>

In [2]:
import dataset

In [3]:
# Fill in between the quotes for your own system
sql_db_path = "./data/"

In [4]:
# Fill in the connection string between the parentheses
db = dataset.connect('sqlite:///movie.db')

In [5]:
# Are these files comma-separated?
separator = '::'

In [6]:
# Get column names from the README
# Replace *'s with column names
users_head = "UserID::Gender::Age::Occupation::ZipCode".split(separator)
movies_head = "MovieID::Title::Genres".split(separator)
ratings_head = "UserID::MovieID::Rating::Timestamp".split(separator)

Before executing the next line, stop and thnk what should be output. Does the actual output match your expectation?

In [7]:
users_head

['UserID', 'Gender', 'Age', 'Occupation', 'ZipCode']

Now it is time to create the database tables. As mentioned, there will be three of them. Interestingly, the `USERS` table and the `MOVIES` table both have unique ID fields already - we will have to take that into account. The `RATINGS` table, on the other hand, does not have a unique ID column, so we don't have to worry about it. 

The general, simple format to create a table is:

`table_variable = db.create_table("table_name")` . # This is what you use for ratings.

But, in the case where the data already has an ID, we have to tell DataSet about it. The general form is:

`table_variable = db.create_table("table_name", primary_id="ID_column_name", primary_type=db.types.integer)`

So, in the case of the `MOVIES` table, the `MovieID` column is the primary key.

In [8]:
try:
    ratings_table = db.create_table("ratings")
except:
    print('This table already exists')

In [9]:
try:
    users_table = db.create_table(
        "user", primary_id="UserID", primary_type=db.types.integer)
except:
    print("Table already exists")

In [10]:
try:
    movies_table = db.create_table(
        "movies", primary_id="MovieID", primary_type=db.types.integer)
except:
    print('Table already exists')

You can, and probably should, put those `create_table()` function calls in `try / except` blocks.

Let's set up variables for the data file names:

In [11]:
users_file = "./data/users.dat"
movies_file = "./data/movies.dat"
ratings_file = "./data/ratings.dat"

OK. Here it is, the moment you've all been waiting for -- we can start stuffing data in the tables we created. 

But, before we do the first, consider these questions and write the answers below:

**Having the ID column in the data caused one difference in our table creation (vs. Week_2).**

1) Do you notice any other differences, and if so, what are they? <br>
- Entering the Primary Keys into the tables schemas

2) If there is a difference, why is it different? <br>
- Primary Keys so there are a unique list of keys for each table we dont have to keep track of. A primary key is defined as a column (or set of columns) where each value is unique and identifies a single row of the table.

3) If there is a difference, how does it affect the data retrieved with a SELECT statement? <br>
- It will enable use to join tables and create more detailed queries.

<hr>

OK... I kind of lied a little bit. There is one more thing to show you about the insert. Yu might remember that this data set is called **"ml-1m"** which stands for _MovieLens - 1 million rows_. In the grand scheme of modern data storage, 1 million rows isn't a huge number, but it **is** enough to make even a fast laptop like mine choke a bit, so we are going to use a technique that many RDBMS systems call **Bulk Insert.** 

Bulk insert is optimized for inserting large amounts of similarly-structured data. SQLite is relatively fast so let's do a quick comparison, using the user's table. After that, it will be **up to you to populate the other 2 tables,** We will also use that progress bar from the FTE, just for fun.

In [12]:
%%time
with open (users_file) as ufile:
    for line in ufile:
        u_dict = dict(zip(users_head, line.split("::")))
        users_table.insert(u_dict)

IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: user.UserID
[SQL: INSERT INTO user ("UserID", "Gender", "Age", "Occupation", "ZipCode") VALUES (?, ?, ?, ?, ?)]
[parameters: ('1', 'F', '1', '10', '48067\n')]
(Background on this error at: http://sqlalche.me/e/gkpj)

In [13]:
users_table.drop()
movies_table.drop()
ratings_table.drop()

# Drop the table before trying to insert again
# You might remember how to do this from Week 2

# HINT: you need the table name, and the drop command...

In [14]:
%%time
users_list = []
with open(users_file) as ufile:
    for line in ufile:
        users_list.append(dict(zip(users_head, line.split("::"))))
users_table.insert_many(users_list)


CPU times: user 153 ms, sys: 13.3 ms, total: 166 ms
Wall time: 183 ms


<hr>

Now **YOU** can decide how you want to do the other two tables, using `insert()` or `insert_many()`.

Since there are only 2 of them, I will let you do them one by one. _Don't get used to it!_

In [None]:
# insert or insert_many movies_file here
# I had an odd error in this file, the line below gets around it
# with open(movies_file, errors="ignore") as mfile:

In [17]:
# insert or insert_many ratings_file here

In [18]:
movie_list = []
with open(movies_file, encoding = "ISO-8859-1") as ufile:
    for x in ufile:
        movie_list.append(dict(zip(movies_head, x.split("::"))))

movies_table.insert_many(movie_list)


In [19]:
rating_list = []
with open(ratings_file, encoding = "ISO-8859-1") as ufile:
    for x in ufile:
        rating_list.append(dict(zip(ratings_head, x.split("::"))))


ratings_table.insert_many(rating_list)


<div class="alert alert-success">
  <strong>Success!</strong> At this point you should have a working relational database containing the MovieLens data!.
</div>

<hr>

### SQL Joins

Records are divided into multiple tables due to the process of **data normalization**. We have to **join tables** in our `SELECT` queries to get one full 
movie rating. 

In general, the **left join** or **left inner join** is the most common, although there are several types. The *left* part refers to the actual layout if you were putting the printed tables side by side on your desk. A left join/left inner join means you have a table with foreign keys on the left side and you are trying to match those keys to their primary keys on the right. Let's look at an example:

<center>Movie</center>

| MovieID | Title | Genre |
|---------|-------|-------|
|1 | Toy Story (1995)  | Animation|Children's|Comedy |
|2 | Jumanji (1995) | Adventure|Children's|Fantasy |
|3 | Grumpier Old Men (1995) | Comedy|Romance |
|4 | Waiting to Exhale (1995) | Comedy|Drama |
|5 | Father of the Bride Part II (1995) | Comedy |

<center>Users</center>

| UserID | Gender | Age | Occupation | ZipCode |
|--------|--------|-----|------------|---------|
| 1 | F | 1 | 10 | 48067 |
| 2 | M | 56 | 16 | 70072 |
| 3 | M | 25 | 15 | 55117 |
| 4 | M | 45 | 7 | 02460 |
| 5 | M | 25 | 20 | 55455 |


<center>Ratings</center>

| UserID | MovieID | Rating | Timestamp|
|--------|---------|--------|----------|
| 1 | 1193 | 5 | 978300760|
| 1 | 661: | 3 | 978302109|
| 1 | 914: | 3 | 978301968|
| 1 | 3408 | 4 | 978300275|
| 1 | 2355 | 5 | 978824291|

It should be obvious in this small example that Ratings are linked to both Movie and Users through their ids. So, to get a complete rating record, we need the Movie record where the MovieIDs match and the user where the UserIDs match. In SQL that loobks like this: 

SQL keywords are in caps.

```
SELECT m.title, m.genres, u.Gender, u.Age, u.Occupation, u.ZipCode, r.Rating, r.Timestamp 
FROM movies m 
INNER JOIN ratings r ON m.MovieID = r.MovieID 
INNER JOIN users u ON r.UserID = u.UserID 
ORDER BY m.Title ASC;
```

Normally, when referencing columns from multiple tables, you have to prefix the column name with the table name, but in this case I used a shortcut -- in the FROM part, I gave each table a one-letter alias. 

Also notice the last two lines. These will put all the matching movie titles together and then alphabetize the list. 

Let's try it and see what comes out.

In [21]:
db = dataset.connect("sqlite:///movie.db")

# con = sqlite3.connect("data/emp.db")

In [38]:
# Put the query in here. NOTE: If you break up the lines, you need 
# a "continuation character" at the end of the line. 

movie_query = "SELECT m.title, m.genres, u.Gender, u.Age, u.Occupation, u.ZipCode, r.Rating, r.Timestamp \
FROM movies m \
INNER JOIN ratings r ON m.MovieID = r.MovieID \
INNER JOIN user u ON r.UserID = u.UserID \
ORDER BY m.Title ASC;"


test_query = 'select * from movies'



In [70]:
# Add the command to execute a query. 
# Reference: https://dataset.readthedocs.io/en/latest/api.html#dataset.Database.query
query_result = db.query(movie_query)


In [72]:
# Convert that result into a list for ease of use.
movie_list = []
for movie in query_result:
    movie_list.append(movie)

# Print out first movie to see what is stored in the list
movie_list[0]

OrderedDict([('Title', '$1,000,000 Duck (1971)'),
             ('Genres', "Children's|Comedy\n"),
             ('Gender', 'F'),
             ('Age', '35'),
             ('Occupation', '0'),
             ('ZipCode', '17870\n'),
             ('Rating', '5'),
             ('Timestamp', '976215651\n')])

# Part 2 - Storing in TinyDB

Hopefully you remember that TinyDB inserts dictionaries as documents. This means that the data in the `movie_list` variable is in the correct form to insert. 

In [34]:
!pip install tinydb

Collecting tinydb
  Downloading https://files.pythonhosted.org/packages/9b/83/2d46115b89640e9b85b94df47216547396e94125245dd3ade186036ce976/tinydb-3.15.1-py2.py3-none-any.whl
Installing collected packages: tinydb
Successfully installed tinydb-3.15.1


In [44]:
!pip install tqdm



In [79]:
from tinydb import TinyDB, Query, where
from tqdm import tqdm
from dataclasses import asdict
import json

tiny_db = TinyDB("ml_nosql.json")

In [90]:
tiny_db.multiple_insert(movie_list)

AttributeError: 'Table' object has no attribute 'multiple_insert'

In [87]:
output_dict = json.loads(json.dumps(movie_list))
# tiny_db.multiple_insert(output_dict)

In [89]:
tiny_db.multiple_insert(output_dict)

AttributeError: 'Table' object has no attribute 'multiple_insert'

In [88]:
for x in tqdm(output_dict):
    tiny_db.multiple_insert(x)


  0%|          | 0/2000417 [00:00<?, ?it/s][A


AttributeError: 'Table' object has no attribute 'insert_many'

In [86]:
results_list = [asdict(row) for row in movie_list]

TypeError: asdict() should be called on dataclass instances

<div class="alert alert-success">
  <strong>Success!</strong> At this point you should have a working NoSQL database containing the MovieLens data!.
</div>

Now we can actually start using this data. 

SQL has some aggregation functions that can be interesting. For example, to find an average of a numeric column:

`select avg(column) from table where condition;`

<div class="alert alert-info">
  <strong>Note:</strong> At this point, I'm not sure that the Dataset library gains us anything, since we are just passing straight SQL through it. You can continue to use Dataset or switch to the SQLite3 library. I'll stay with Dataset, since it is already loaded. 
</div>

We can modify our join from above to get an average rating from women for the movie "Die Hard" like this:

In [None]:
movie_query = "select m.title, u.Gender, avg(r.Rating)\
from movies m \
inner join ratings r on m.MovieID = r.MovieID \
inner join users u on r.UserID = u.UserID \
where u.Gender = 'F' and m.title = 'Die Hard (1988)';"

In [None]:
query_result = db.query(movie_query)

In [None]:
# A quick little list comprehension to extract the results
f_avg = [row for row in query_result]

f_avg

In [None]:
# So, to print it nicely:
print(f"Average female rating for {f_avg[0]['Title']} is {f_avg[0]['avg(r.Rating)']}")

That process is slightly more manual in TinyDB. Here, we can use TinyDB's `where()` command along with `matches()` to find movies with the right title, then use a logical and `&` to limit it to women. We can also take advantage of Python's built in `sum()` and `len()` commands to help us out.

It sounds more complicated than it is. Like this:


In [None]:
female_dh_set = tiny_db.search( (where('Title').matches('Die Hard')) & (where('Gender').matches('F')) )

That gives us a list of dictionaries, prove that is true to yourself, if you need to.

The rest is simple (Remember all numbers are stored as strings!):

In [None]:
dh_avg_f = sum(int(r['Rating']) for r in female_dh_set) / len(female_dh_set)

In [None]:
print(f'Average: {dh_avg}')

## Questions:

1. Using the relational database you built, compare M and F average ratings for "Die Hard."
2. Do the same comparison with the NoSQL database.
3. Do the averages match?
4. What is the age range of female reviewers of "Gone With The Wind?" (Hint: in SQL, you can use a column more than once. Hint 2: There may be built in functions that help.)