# Assessment 1 (20 marks)

## Data Modeling (8 marks)

In this assessment you implement an algorithm that reads a csv data dump from Reddit and creates a database (relational or non-relational), taking into account the different entities and relationships holding between them. With this database in place, you are also asked to implement queries for generating reports about the dataset.

### Starter code for loading the csv file and connecting to database

In [1]:
data_path = 'data_portfolio_21.csv'

Connect to the school's MySQL server using your credentials.

In [2]:
import pymysql
import credentials

password = credentials.MYSQL_PASSWORD
# Connect to the database
connection = pymysql.connect(host=credentials.HOST_NAME,
                             user=credentials.MYSQL_USERNAME,
                             password=password,
                             db=credentials.DB_NAME,
                             charset='utf8mb4')


---

### SQL and Python code for creating the tables [3 marks]

In [3]:
# your code here
try:
    with connection.cursor() as cur:
        q="""ALTER DATABASE c2075016_covid_reddit CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"""
        cur.execute(q)
        connection.commit()
        
        q="""DROP TABLE IF EXISTS favourites"""
        cur.execute(q)
        connection.commit()
        
        q="""DROP TABLE IF EXISTS posts"""
        cur.execute(q)
        connection.commit()
        
        q="""DROP TABLE IF EXISTS users"""
        cur.execute(q)
        connection.commit()
        
        q="""DROP TABLE IF EXISTS subreddits"""
        cur.execute(q)
        connection.commit()
        
        q= """CREATE TABLE subreddits (
        subr_ID INT AUTO_INCREMENT NOT NULL,
        subr_name VARCHAR(64) NOT NULL,
        subr_created_at DATE NOT NULL,
        subr_description VARCHAR(4096),
        subr_numb_members INT UNSIGNED NOT NULL,
        subr_numb_posts INT UNSIGNED NOT NULL,
        CONSTRAINT subreddits_PK PRIMARY KEY (subr_ID)
        ) DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;"""
# I chose to use DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci for every table in order to allow non-ASCII 
# characters in the database such as emojis and special characters. I chose to do this rather than cleaning the 
# text data to remove non-ASCII characters in order to avoid the data loss that this would of necessity involve.
        cur.execute(q)
        connection.commit()
        
        q="""CREATE TABLE users(
        user_ID INT AUTO_INCREMENT NOT NULL,
        user_name VARCHAR(128) NOT NULL,
        user_num_posts INT UNSIGNED,
        user_registered_at DATE,
        user_upvote_ratio FLOAT,
        CONSTRAINT users_PK PRIMARY KEY (user_ID)
        ) DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;"""
        cur.execute(q)
        connection.commit()
        
        q="""CREATE TABLE posts(
        post_ID INT AUTO_INCREMENT NOT NULL,
        author_ID INT NOT NULL,
        subreddit_ID INT NOT NULL,
        posted_at DATE NOT NULL,
        num_comments INT UNSIGNED NOT NULL,
        score INT UNSIGNED NOT NULL,
        selftext TEXT NOT NULL,
        title TEXT NOT NULL,
        total_awards INT UNSIGNED NOT NULL,
        upvote_ratio FLOAT NOT NULL,
        CONSTRAINT posts_PK PRIMARY KEY (post_ID),
        CONSTRAINT authors_FK FOREIGN KEY (author_ID) REFERENCES users(user_ID) ON DELETE CASCADE,
        CONSTRAINT subreddits_FK FOREIGN KEY (subreddit_ID) REFERENCES subreddits(subr_ID) ON DELETE CASCADE
        ) DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;"""
        cur.execute(q)
        connection.commit()
# I used the TEXT data type for the post title and selftext beacuse the in some posts the combined length of the 
# title and description exceeded the row length limit of 65535 bytes. Therefore it was not possible to balance the 
# sizes of varchar fields to include all the data. Therefore, to prevent data loss, the TEXT data type was used, 
# which avoids this problem by storing the data from these fields separately, with only 9 to 12 bytes being 
# contributes towards the row length limit per TEXT field. 

# Missing data in the selftext of posts was dealt with by setting the selftext value to an empty string if NULL. I 
# chose this appraoch rather than inserting nulls into the database because I wished to concatenate the title and 
# selftext of posts later in order to perform the classification experiment, and concatenating with a null value 
# returns a null. For the same reason, the selftext and title columns were set NOT NULL to prevent title or 
# selftext containing nulls being inserted into the database in the future. 
# [REF: https://www.w3schools.com/mysql/func_mysql_concat.asp Accessed on: 05/05/2021] 
# Furthermore, this approach allowed the posts to be inserted into the database in one cur.executemany call rather 
# than having to distinguish between those that had NULL selftext and those that did not.
        
        q="""CREATE TABLE favourites (
        subr_ID INT NOT NULL,
        user_ID INT NOT NULL,
        CONSTRAINT favourites_FK PRIMARY KEY (subr_ID, user_ID),
        CONSTRAINT favourited_subreddits_FK FOREIGN KEY (subr_ID) REFERENCES subreddits(subr_ID) ON DELETE CASCADE,
        CONSTRAINT favouriting_user_FK FOREIGN KEY (user_ID) REFERENCES users(user_ID) ON DELETE CASCADE
        ) DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci;"""
        cur.execute(q)
        connection.commit()
        
        
        print('success')
        
finally: 
    connection.close



success


### Python logic for reading in the data [2 marks]

In [4]:
import csv

seen_users={}
seen_subreddits={}
authors_list=[]
subreddits_list=[]
favourites_list=[]
user_ID=1
subreddit_ID=1
posts_list=[]
favourites_dictionary={}

i=1

with open(data_path, 'r') as f:
    
    csv_reader=csv.DictReader(f, delimiter=',', quotechar='"')
    for line_dictionary in csv_reader:

        if line_dictionary['subr_faved_by']=='[]':
            line_dictionary['subr_faved_by']=[]
        else:
            line_dictionary['subr_faved_by']=line_dictionary['subr_faved_by'][2:-2].split('\', \'')
        
        author=line_dictionary["author"]

        if author not in seen_users:

            user_num_posts=line_dictionary["user_num_posts"]
            user_registered_at=line_dictionary["user_registered_at"]
            user_upvote_ratio=line_dictionary["user_upvote_ratio"]

            author_details=[user_ID, author, user_num_posts, user_registered_at, user_upvote_ratio]
            authors_list.append(author_details)

            seen_users[author]=user_ID
            user_ID+=1

        subreddit=line_dictionary["subreddit"]

        if subreddit not in seen_subreddits:

            subr_created_at=line_dictionary["subr_created_at"]
            subr_description=line_dictionary["subr_description"]
            subr_numb_members=line_dictionary["subr_numb_members"]
            subr_numb_posts=line_dictionary["subr_numb_posts"]

            subreddit_details=[subreddit_ID, subreddit, subr_created_at, subr_description, subr_numb_members, subr_numb_posts]
            subreddits_list.append(subreddit_details)
            
            favourited_users_list=line_dictionary["subr_faved_by"]
            favourites_dictionary[subreddit_ID]=favourited_users_list

            seen_subreddits[subreddit]=subreddit_ID
            subreddit_ID+=1
            


        this_author_ID=seen_users[author]
        this_subreddit_ID=seen_subreddits[subreddit]
        posted_at=line_dictionary["posted_at"]
        num_comments=line_dictionary["num_comments"]
        score=line_dictionary["score"]
        if line_dictionary["selftext"]=="NULL":
            selftext=''
        else:
            selftext=line_dictionary["selftext"]
        title=line_dictionary["title"]
        total_awards_received=line_dictionary["total_awards_received"]
        upvote_ratio=line_dictionary["upvote_ratio"]

        posts_details=[i, this_author_ID, this_subreddit_ID, posted_at, num_comments, score, selftext, title, total_awards_received, upvote_ratio]
        posts_list.append(posts_details)
        
        i+=1
    
    for subreddit_ID in favourites_dictionary:
        for user in favourites_dictionary[subreddit_ID]:
            favouriting_user_ID=seen_users[user]
            favourites_details=[subreddit_ID, favouriting_user_ID]
            favourites_list.append(favourites_details) 
        
    print("Success")



Success


The only potential multi-valued column present in this dataset was the 'subr_faved_by' data which contains a stringified list of users. I dealt with this by first extracting the list of strings from the stringified list by removing the sqaure brackets and splitting with '\', \'' as the delimiter. On the first pass through the data, these lists were saved for unique subreddits in the variable favourites_dictionary, with the ubreddit_ID as the key. Then on a second pass over this data, for each subreddit, the user_ID was obtained for each element of the favourites list and each pair \[subreddit_ID, user_ID\] appended to the list of favourites for insertion into a dedicated favourites linking table in the database.

It was initially thought that there may be some users present in the dataset as having favourited a subreddit, but not having posted, and therefore that their full details might not be present. However, running 
                
            if user not in seen_users:
                seen_users[user]=user_ID
                user_details=[user_ID, user]
                users_list.append(user_details)
was found not to increase the length of users_list, so this step was omitted. However, the possibility for this to be part of any future dataset that may be added to the databse was left open by allowing all fields in the user table apart from 'user_ID' and 'user_name' to take a NULL value.


### SQL and Python code for populating the tables [3 marks]

In [5]:
try:
    with connection.cursor() as cur:
        q="""INSERT INTO subreddits VALUES (%s, %s, %s, %s, %s, %s);"""
        cur.executemany(q, subreddits_list)
        connection.commit()
        
        q="""INSERT INTO users VALUES (%s, %s, %s, %s, %s);"""
        cur.executemany(q, authors_list)
        connection.commit()
    
        q="""INSERT INTO posts VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s);"""
        cur.executemany(q, posts_list)
        connection.commit()
    
        q="""INSERT INTO favourites VALUES (%s, %s);"""
        cur.executemany(q, favourites_list)
        connection.commit()
    
finally:
    connection.close()