# **Workshop #2**

### *Load of raw data - `the_grammy_awards` dataset*
---

In this notebook, we are working with the `the_grammy_awards` dataset. The workflow includes setting up the project directory, importing necessary dependencies, loading raw data from a CSV file into a Pandas DataFrame, and uploading the data to a PostgreSQL database for further analysis.

With this workflow, we demonstrate the process of data ingestion and initial exploration, setting the stage for more detailed analysis and insights.

## ***Setting the project directory***

This script attempts to change the current working directory to the specified path.
If the directory change fails due to the directory not being found, it prints a message indicating that the user is already in the correct directory.

In [1]:
import os

try:
    os.chdir("../../Workshop #2")
except FileNotFoundError:
    print("You are already in the correct directory.")

## ***Importing dependencies***

**Modules:**
* **src.database.db_operations**: Custom module for database operations.

**For this environment we are using:**
* ***Pandas*** >= 2.2.2

**From the src.db module, we are also using:**
* ***SQLAlchemy*** >= 2.0.32
    * *SQLAlchemy Utils* >= 0.41.2
* ***python-dotenv*** >= 1.0.1

In [2]:
from src.database.db_operations import creating_engine, load_raw_data

import pandas as pd

## ***Loading the raw data***

### **Turning the CSV into a Pandas Dataframe object**

In this code block, data is loaded from a CSV file located at `data/raw/the_grammy_awards.csv` using the `pandas` library. The `pd.read_csv` function reads the file and converts it into a DataFrame. 

Subsequently, the `head()` method of the DataFrame is used to display the first few rows of the dataset, providing a quick and preliminary view of the CSV file's content. This initial visualization is useful for verifying that the data has been loaded correctly and for getting a general idea of the dataset's structure and content.

In [3]:
df = pd.read_csv("data/raw/the_grammy_awards.csv")
df.head()

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,img,winner
0,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Bad Guy,Billie Eilish,"Finneas O'Connell, producer; Rob Kinelski & Fi...",https://www.grammy.com/sites/com/files/styles/...,True
1,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,"Hey, Ma",Bon Iver,"BJ Burton, Brad Cook, Chris Messina & Justin V...",https://www.grammy.com/sites/com/files/styles/...,True
2,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,7 rings,Ariana Grande,"Charles Anderson, Tommy Brown, Michael Foster ...",https://www.grammy.com/sites/com/files/styles/...,True
3,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Hard Place,H.E.R.,"Rodney “Darkchild” Jerkins, producer; Joseph H...",https://www.grammy.com/sites/com/files/styles/...,True
4,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Talk,Khalid,"Disclosure & Denis Kosiak, producers; Ingmar C...",https://www.grammy.com/sites/com/files/styles/...,True


Additionally, the `info()` method of the DataFrame is used to provide a concise summary of the dataset. This summary includes the number of non-null entries, the data type of each column, and the memory usage of the DataFrame. The `info()` method is particularly useful for understanding the overall structure of the dataset, identifying missing values, and ensuring that the data types are as expected.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4810 entries, 0 to 4809
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          4810 non-null   int64 
 1   title         4810 non-null   object
 2   published_at  4810 non-null   object
 3   updated_at    4810 non-null   object
 4   category      4810 non-null   object
 5   nominee       4804 non-null   object
 6   artist        2970 non-null   object
 7   workers       2620 non-null   object
 8   img           3443 non-null   object
 9   winner        4810 non-null   bool  
dtypes: bool(1), int64(1), object(8)
memory usage: 343.0+ KB


### **Uploading the data to PostgreSQL**

In this code block, the function `load_raw_data` is used to load the raw data into a database. The function takes three parameters:

1. `engine`: This is the SQLAlchemy engine that establishes the connection to the database where the data will be loaded.
2. `df`: This is the DataFrame containing the raw data that has been read from the CSV file.
3. `grammy_awards_raw`: This is the name of the table in the database where the raw data will be stored.

By calling `load_raw_data(engine, df, "grammy_awards_raw")`, the DataFrame `df` is loaded into the specified table in the database, allowing for further processing and analysis within a structured database environment.

In [5]:
engine = creating_engine()

09/14/2024 02:04:49 PM Engine created. You can now connect to the database.


In [6]:
load_raw_data(engine, df, "grammy_awards_raw")

09/14/2024 02:04:49 PM Creating table grammy_awards_raw from Pandas DataFrame
09/14/2024 02:04:50 PM Table grammy_awards_raw created successfully.
