# Homework 3 - Databases and SQL
In this guide, we will be connecting to the sqlite database created from the lecture, fill it with values, then run a few queries. You will be using the data files from the previous homework.

### Instructions
1. Follow the instructions on how to setup your Python and Jupyter (or VSCode) environment and cloning or downloading our repository. Instructions can be found in the class notes.
2. Fill the missing pieces of code in the provided notebook.
3. Answer the questions in the notebook through code.
4. Run the notebook and make sure everything works.


### Dataset Overview
We will use two datasets for this assignment. The first one is the same as used in HW1, which consists of four text files, each containing a story. Files are in the `Datasets` directory of this repository. The stories are:

- `story-1.txt`: The Monkey and the Crocodile
- `story-2.txt`: The Musical Donkey
- `story-3.txt`: A Tale of Three Fish
- `story-4.txt`: The Foolish Lion and the Clever Rabbit

The second dataset covers information about soccer players in sqlite format. This file is located in the `Datasets` directory of this repository. The file is called `fifa_soccer_dataset.sqlite.gz`.

**IMPORTANT** The database is compressed and needs to be decompressed before use. You can do this by running the following command in your terminal on Linux or MacOS:

```bash
gunzip Datasets/fifa_soccer_dataset.sqlite.gz
```

If you are using Windows, you can use the following command in your powershell:

```powershell
$sourceFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite.gz"
$destinationFile = "$PWD\Datasets\fifa_soccer_dataset.sqlite"

$inputStream = [System.IO.File]::OpenRead($sourceFile)
$outputStream = [System.IO.File]::Create($destinationFile)
$gzipStream = New-Object System.IO.Compression.GzipStream($inputStream, [System.IO.Compression.CompressionMode]::Decompress)
$gzipStream.CopyTo($outputStream)

$gzipStream.Close()
$outputStream.Close()
$inputStream.Close()
```

Alternatively, you can extract the file using the GUI of your operating system.


### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.

---

### Part 1 Story Analysis
First, we need to import the correct library to use sqlite functions. Can you figure out which library are we going to use?

In [28]:
import #Input needed Library here
import os

Now we have our functions ready to go, let's get the current path and connect to our database. For `dbPath` variable, create a db with name `mydb.sqlite`.

In [None]:
data_path = 'Datasets' # Path to the datasets folder
dbpath = # Select a path to save the database file
conn = sqlite3.connect(dbpath) 

Now connected, we can create our cursor variable using the `cursor()` function.

In [30]:
cur = conn._()

Using `cur`, we want to add in our new tables. However, if we add them multiple times, we will receive an error. So, we first need to `DROP` the tables if they exist already.<br>Enter your `DROP` queries below as an argument to the `execute` methods. Do this twice, once for the `stories` table and one for the `word_counts` table.

In [None]:
drop_stories_query  = # Input DROP query for stories table here
drop_info_query = # Input DROP query for information table here

#Now we run both queries
cur.execute(drop_stories_query)
cur.execute(drop_info_query)

We now have a clean slate set up to create our new tables! Now, design the queries to create new tables for `stories` and `word_counts`. <br>
The `stories` table only needs a primary key `story_id` that is an integer and a text field called `story` to store the corrosoponding story text.
<br>
The `word_counts` table will need a little more. Please include the following:
- `word_id`: primary key, integer
- `word`: text
- `count`: integer
- `story_id`: integer, foreign key to `stories` table

In [40]:
create_stories_query = # Insert CREATE TABLE query for stories table here
create_info_query =  # Insert CREATE TABLE query for information table here
cur.execute(create_stories_query)
cur.execute(create_info_query)
conn.commit()

With our new empty tables ready, we can now loop through the stories and store the word counts.<br> Similar to how we looped through the stories in HW 1, we now have an additional step of inserting this data.<br>
In the cell below, please add the two queries for inserting these data rows. <br><br>
The first insert is for storing the story text in the `stories` table while reading from the file. The second insert is after counting up all the words, and is for inserting those values into the `word_counts` table. Remember to pass these query string variables into the `execute()` methods.

In [None]:
stories = ["/story-1.txt", "/story-2.txt","/story-3.txt","/story-4.txt"]

for story in stories:
    words = []
    count_of_each_word = {}
    story_id = ""
    try:
        # Open the file
        story_path = os.path.join(data_path, story)
        with open(story_path,"r", encoding='utf-8') as fp:
            # reading data from file and splitting into words
            # and storing them in a list
            story_text = fp.read()
            
            # For the below query, you will need to use a '?' to
            #     represent where you want the story_text to be inputed.
            #     The actual text in story_text is passed in a tuple as 
            #     second parameter of the execute command().
            
            insert_story_query = # Input INSERT query for the stories table here
            cur.execute(insert_story_query, (story_text,))
            
            #Grabbing the last id inserted, so we can use it when inserting values into the word_counts table
            story_id = cur.lastrowid
            conn.commit()
            words = story_text.split()
            
            # Close the file
            fp.close()
            
    except Exception as e:
        print("Unable to open the file: " + str(e))

    # Just like before, we are iterating over each word and using a dictonary to store the word counts
    for word in words:
        if(word in count_of_each_word):
            count_of_each_word[word] += 1
        else:
            count_of_each_word[word] = 1

    for key in count_of_each_word:
        insert_count_query = # Input query here, using '?' again in the VALUES () portion
        
        cur.execute(insert_count_query, (key, count_of_each_word[key], story_id))
        conn.commit()


Finally! Our tables are filled and we can now run SELECT queries against it to pull the data we want. There are two queries you will need to run. 
### SELECT Query One: 
Grab all rows from `word_counts` where the word is "the" and the count is greater than 1.

In [None]:
query_one = #Input SELECT query one here
cur.execute(query_one)
records = cur.fetchall()
for record in records:
    print(record)
    


### SELECT Query Two: 
Grab the `story_id`, `story`, and `count` columns where the word is "the". You should use a JOIN statement for this query, and only need to include `story_id` from one table.

In [None]:
query_two = #Input SELECT query two here
cur.execute(query_two)
records = cur.fetchall()
for record in records:
    print(record)

### Part 2 - Soccer Database
Now that we have our first database filled with data, we can move on to the second one. We will be using the `fifa_soccer_dataset.sqlite` file.
Feel free to use either sqlite3 or pandas to run your queries!
If you plan to use pandas, check the `pandas` documentation for how to read in a sqlite database. In particular you can load a sqlite database into a pandas dataframe using the `read_sql_query` function.


In [None]:
dataset_path = "../../Datasets/fifa_soccer_dataset.sqlite" # Fix your path accordingly

import sqlite3
conn = sqlite3.connect(dataset_path)


If you are using pandas, import it and read in the database. For instance:

In [None]:
import pandas as pd
# get all tables
df_tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", conn)
display(df_tables)
# get all players from Players table
df_players = pd.read_sql_query("SELECT * FROM Player", conn)
display(df_players)


If you prefer to use just sqlite3, you can do that as well. Just make sure to import the library and connect to the database:

In [None]:
conn = sqlite3.connect(dataset_path)
cur = conn.cursor()
# get all tables
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cur.fetchall()
for table in tables:
    print(table)
# get all players from Players table
cur.execute("SELECT * FROM Player")
players = cur.fetchall()
for player in players:
    print(player)

**QUESTION 1**

Print the birthday of the player whose name is “Aaron Kuhl”. *Hint: Use ‘Player’ table*

In [None]:
# YOUR CODE HERE

**QUESTION 2**

Print the number of times the team_fifa_api_id ‘673’ appeared in Team_attribute table. *Hint: Apply GROUP BY clause on team_fifa_api_id attribute*


In [None]:
# YOUR CODE HERE

**QUESTION 3**

Print country name and league name that have matches on “2014-04-20 00:00:00”. *Hint: Apply join on Match Table and Country table, then Match Table and League Table*


In [None]:
# YOUR CODE HERE