## Case Study 1
#### Note: This should be completed individually.

Assume you are new to the data science field, and you want to find out what real practitioners and soon-to-be data scientists are concerned about. One place where you may find such information is X (formerly known as Twitter). However, X users often use their real identities and may have reservations about sharing all their opinions publicly. Another place where such information maybe found is the datascience subreddit on Reddit.com (https://www.reddit.com/r/datascience/). Users are assumed to be anonymous and they are more likely to share their opinions without reservations. To find out common concerns among the datascience subreddit users, it might be a good idea to collect the top 100 posts in the subreddit in the year 2023. You might also collect the top 3 comments of each of those posts. In this case study, we will do exactly that. Specific details can be found in the next few cells. 

This data can be used for many different projects. However, we are only going to focus on the "data gathering" part. We will also do some cleaning.

**Note**: This case study contributes 10% to your overall grade.

## Step 1: 
###  15 points


**Description:** 

Learn about the **PRAW** package for Python and learn how you can use it to load reddit posts, comments etc. on a Jupyter Notebook. Do a Google search. You might find tutorials. It is okay to use them. You may need to use secret keys for this part. For that you will need to open a Reddit account. You can use a throwaway account for this purpose. Write your code in the cell below. Any code you write to retrieve data from Reddit can go there.

Here is a link to the PRAW documentation: https://praw.readthedocs.io/en/stable/#getting-started

**Grading criteria:** 

The code for this step must be correct. Otherwise, the next steps cannot be completed. In that case, the next steps will not be graded. If you receive a praw object from the data science subreddit, you will get full 15 points.' Other methods may be considered, but not encouraged.

In [1]:
# your code for step 1 goes here

#pip install praw
#pip install --upgrade https://github.com/praw-dev/praw/archive/master.zip

import praw

reddit = praw.Reddit(client_id='mVQ9Pk6cuqPgOgZwlKd6Hw',
                     client_secret='hMLeBBi-M754Ye5N2RjM3oE7ZimATg',
                     password='LSFuzzy4510!', 
                     user_agent='python_praw',
                     username='Bruno4510')

subreddit=reddit.subreddit('datascience')
top=subreddit.top(limit=5)
print(top)



<praw.models.listing.generator.ListingGenerator object at 0x0000024D759BCB20>


## Step 2: 
### 10 + 20 + 10 + 15 + 5 + 5 = 65 points

**Description**:
Once you have the mechanism in place to retrieve data from Reddit, you next step is to determine which parts of the data is necessary. For this case study, collect only the top posts from the year 2023. Also consider if the score of each post was above 50 or not. If the score was below 50, it might not have been an important post. Do not consider those posts. 

You may also observe that sometimes posts with memes or jokes get a lot of 'upvotes,' and because of that they may  have high scores, but they may not be useful for this case study. To address this problem, you will simply get rid of any post that has fewer than 5 words in the title. 

You will also notice that praw returns time as an integer. It is inconvenient for us to read time like that. You may want to convert the integer time to human readable time. You do not need to mention hours, minutes, or seconds. Just year, month and date is enough.

**Grading Criteria:**
* posts are only from the year 2023: 10 points
* the integer time format converted into year-month-day: 20 points
* only posts with scores more than 50 were considered: 10 points
* only post titles with more than 5 words were kept: 15 points
* minimum 100 posts were collected: 5 points
* three comments collected for each post: 5 points

Note: All six grading criteria can be satified by writing one line or many lines of code. It does not matter. As long as your code satisfies the six criteria (in one line or many lines), you will get full points. Otherwise, you will get partial credits.

Also note: In case the API does not allow you to collect 100 posts, you can collect 80 or 90 etc. In that case, please copy and paste the error message in a new cell below.

In [2]:
# your code for step 2 goes here
# create as many cells as necessary
import pandas as pd
import datetime

subreddit = reddit.subreddit("datascience")

# Scraping the top posts of all time
posts = subreddit.top(time_filter = "all",limit=None)

#dictionary containing various post attributes
posts_dict = {"Post title": [], "Post Text": [],"Date":[],
              "ID": [], "Post score": [], "Upvote Ratio": [],
              "Total Comments": [], "Post URL": [],
              "Original Content": [], "Top comment 1":[], 
              "Top comment 2":[], "Top comment 3":[] }

#setting start date as a timestamp
start_date = '01-01-23 00:00:00'
#start_date = datetime.datetime.strptime.timestamp()
start_date = datetime.datetime.strptime(start_date, '%d-%m-%y %H:%M:%S').timestamp()


for post in posts:
    # Date of each posts' creation
    date_creation = post.created_utc
    #applying required conditions
    if date_creation > start_date and post.score>50 and len((post.title).split(' '))>5 and post.num_comments>=3:
        # Title of each post
        posts_dict["Post title"].append(post.title)
     
        # Text inside a post
        posts_dict["Post Text"].append(post.selftext)
     
        # Unique ID of each post
        posts_dict["ID"].append(post.id)
     
        # The score of a post
        posts_dict["Post score"].append(post.score)
        
        posts_dict["Upvote Ratio"].append(post.upvote_ratio)
     
        # Total number of comments inside the post
        posts_dict["Total Comments"].append(post.num_comments)
         
        # Date the post was Created
        date_and_time=pd.to_datetime(post.created_utc, unit='s')
        dt_string=str(date_and_time)
        date=dt_string[:10]
        posts_dict["Date"].append(date)
        
        # URL of each post
        posts_dict["Post URL"].append(post.url)
        
        # Flair of each post
        posts_dict["Original Content"].append(post.is_original_content)
        
        #Getting top 3 comments
        comment_list=post.comments.list()
        
        comment1=comment_list[0]
        comment2=comment_list[1]
        comment3=comment_list[2]
        
        posts_dict["Top comment 1"].append(comment1.body)
        posts_dict["Top comment 2"].append(comment2.body)
        posts_dict["Top comment 3"].append(comment3.body)
    

# Saving the data in a pandas dataframe
all_posts = pd.DataFrame(posts_dict)
all_posts.head(20)





Unnamed: 0,Post title,Post Text,Date,ID,Post score,Upvote Ratio,Total Comments,Post URL,Original Content,Top comment 1,Top comment 2,Top comment 3
0,"As a hiring manager - this, this right here",,2023-01-27,10mmm38,2624,0.97,135,https://i.redd.it/fk95v2ghilea1.png,False,He also has a PhD in mathematics so I'm sure t...,I’m assuming he had a strong mathematical/stat...,How seriously are personal projects taken? I'm...
1,Pretty Accurate Chart to Clear Up Job Title Am...,,2023-04-26,12zwc24,1971,0.94,199,https://i.redd.it/1t5iwk9ymbwa1.jpg,False,"> _Yeah, well, that's just, like, your opinion...",\nChart : Ambiguous\nJob Title Ambiguities : e...,"I've done all of those things, in varying amou..."
2,"300,000+ Tech jobs have been vanished in the l...",,2023-01-20,10h4zfl,1379,0.92,182,https://i.redd.it/x9hvdw9rdada1.jpg,False,Based on your chart it's 200k+ in the past six...,Someone else posted something similar but inst...,If my data viz was this bad I would deserve it
3,Which programming language is required for a...,,2023-02-27,11d4uys,1235,0.84,238,https://i.redd.it/bdmbfzxtvpka1.jpg,False,It's a shame you didn't copy the FAIRLY useful...,"It's never matlab, why did my PhD supervisor m...",Php 🤣🤣🤣 has the author learned php and trying ...
4,Everyone here seems focused on advanced modell...,I’ve been browsing this sub for over 5 years. ...,2023-03-18,11uzhqa,1195,0.95,190,https://www.reddit.com/r/datascience/comments/...,False,I've lost track of how many DS contracts are: ...,Data analytics.. that’s data analytics,"True, but I would argue in general, if you're ..."
5,I investigated the Underground Economy of Glas...,Online company reviews are high stakes.\n\nTop...,2023-05-15,13ilm03,1152,0.99,63,https://www.reddit.com/r/datascience/comments/...,False,Very thorough analysis OP. I wonder if there’s...,Incredibly thorough analysis. And that's why I...,This is a great investigation. Wow.
6,"When Pandas.read_csv ""helpfully"" guesses the d...",,2023-02-27,11ddeft,1121,0.97,23,https://i.redd.it/tppr6p77tqka1.png,False,"If you open it in Excel first, he'd be Agent J...",The further I get into ML and data engineering...,That dummy 1 is obviously a float not an int.
7,The true reason I chose to be a DS..,,2023-01-16,10de0j4,1033,0.91,27,https://i.redd.it/2te69la63gca1.jpg,False,"I'm not a DS, I'm a fleshy Reinforcement Algor...",What the hell man I just got here,I did not need to be called out like this today
8,Very simple guys. This is the way to go.,,2023-03-23,1200b4s,1033,0.85,253,https://i.redd.it/aspuqjwxhkpa1.png,False,I love when people who have a relevant degree ...,"I mean that would be great, if it was any less...",It’s very r/restofthefuckingowl kind of advice...
9,Changing my feminine first name to a masculine...,Just a heads up to any other women that this c...,2023-01-04,1032pgs,974,0.87,226,https://www.reddit.com/r/datascience/comments/...,False,I have a female cheer leader name. I once had ...,I had the same experience with legally changin...,What’s crazy is I have the opposite problem. I...


In [3]:

#Checking number of posts collected
all_posts.shape

(121, 12)

## Step 3: 
### 10 points

Save the data on your local disk. You may have used lists or similar data structures for the intial porcessing. Convert that data structure into a Pandas dataframe. Save the dataframe as a .csv file into your local disk. 

Here are the column details:

Column 1: Date

Column 2: Post score

Column 3: Post title

Column 4: Top comment 1

Column 5: Top comment 2

Column 6: Top comment 3

When you create the .csv file, it should have 101 rows (including column names) and 6 columns.

**Grading criteria:**
If your code produces a .csv file in the local disk in the same folder as the Jupyter Notebook file, you get full points. Otherwise, no point.

In [4]:
# your code for step 3 goes here
# create as many cells as necessary

#creating appropriate dataframe having required columns 
final_df=all_posts.filter(items=['Date','Post score','Post title','Top comment 1', 'Top comment 2', 'Top comment 3']).head(100)
final_df

#converting dataframe to a csv file
final_df.to_csv('isha_jain_casestudy1.csv', encoding='utf-8',index=False)




## Step 4:
### 10 points
#### Presentation slides:
   
Create presentation slides for this case study. The presentation slides should provide an overview of the problem you tried to solve, methods you have used (don't put actual code in the slides), and if you have discovered new insights from the data you have collected. You may put actual post titles or comments in the slide that you found insightful and interesting. The number of slides should be around 6-7 (no hard limit). Three of you will be randomly chosen and be asked to present your work in the class. You should be prepared to present your work for 5 mins.


**Notes on grading**: 5 points will be deducted if you are not prepared to present on the day of submission.

### What to submit:

All files should be named in the following format:

firstname_lastname_casestudy_1.pdf

firstname_lastname_casestudy_1.ipynb

firstname_lastname_casestudy_1.csv

etc.


Put the Jupyter Notebook file and the .csv file in a folder. Then convert your presentation slides to a PDF file and put it in the same folder. Zip the folder. After zipping, it should have the extension .zip. The name of the .zip file should be firstname_lastname_casestudy_1.zip . Upload the .zip file on Canvas.

