# Talk Business
## General Assembly | Data Science Intensive | Project 3

## Introduction

### Summary
This project aimes to recognize the topics Small Business Owners and Startup owners discuss about on Reddit, their similarities, and differences. The results are targeted towards a Financial Institution, that is starting a new blog to attract companies at small size. This research investigates in what these two gruops differ for interest, and whether it might make sense to address them together or separately.

### Source
Data are pulled from **r/stratups** and **r/smallbusiness** subreddits. Both are among the most activeon Finance & Business section on Reddit.

### In this Notebook
In this notebook I use a function to collect data from my subreddits using the PushShift API, concatenate the dataframes resulting from the export.

At the end I export the dataframe as `corpus.csv`, to be processed in the next notebook.



### Notebook Index

- [01 | Data Collection](01_data_collection.ipynb)
- [02 | EDA & Cleaning](02_eda.ipynb)
- [03 | Model](03_model.ipynb)
- [04 | Sentiment Analysis](04_sentiment_analysis.ipynb)

## 1. Imports & Preliminary Checks

For this notebook I import pandas and numpy used to handle and export the dataframes, together with  time and request, used in the function below.

In [21]:
import pandas as pd
import numpy as np

import requests
import time

First I check that the APIs are working and I am connecting properly.

In [23]:
res_stup = requests.get("https://api.pushshift.io/reddit/search/submission?subreddit=startups")
res_smb = requests.get("https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness")

In [24]:
res_stup.status_code, res_smb.status_code

(200, 200)

Request result is connected (200) for both subreddit API link so I proceed.

## 2. PushShift

From an analysis of the Json file I decide I will import the following features `title`, `selftext`, `subreddit`, `created_utc`, `author`, `num_comments`, `score`, `is_self`, and I save them in a list called `subfields`.. 

In [39]:
subfields = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']

The following function runs accepting the subreddit, the days intervall and the number of queries to perform, and returns a dataframe of the posts, containing all information for the subfields selected above.

Then I run the function for both subreddits.

In [40]:
def import_posts(subreddit, kind = 'submission', days=20, number=100):

    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 

    # construct full url
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500

    posts = []

    for i in range(1, number + 1):
        URL = f"{stem}&after={days * i}d"
        print("Querying from: " + URL)
        res = requests.get(URL)
        assert res.status_code == 200
        json = res.json()['data']
        df = pd.DataFrame(json)
        posts.append(df)
        time.sleep(2)
    
    corpus = pd.concat(posts, sort= False)
    return corpus

In [41]:
startup_df = import_posts("startups")

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=20d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=40d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=80d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=100d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=140d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=160d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=180d
Querying from: https://api.pushshift.io/reddit/search/submission?sub

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1600d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1620d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1640d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1660d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1680d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1700d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1720d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1740d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=startups&size=500&after=1760d
Querying from: https://api.pushshift.io/reddit/search/s

In [42]:
smbiz_df = import_posts("smallbusiness")

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=20d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=40d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=60d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=80d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=100d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=120d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=140d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=160d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=180d
Querying from: https://

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1520d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1540d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1560d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1580d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1600d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1620d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1640d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1660d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=smallbusiness&size=500&after=1680d
Querying f

## 3. Dataframes

Then I name columns with the names of the subfields in the list.

In [43]:
startup_df = startup_df[subfields]

In [44]:
smbiz_df = smbiz_df[subfields]

In [45]:
startup_df.shape, smbiz_df.shape

((9997, 8), (9997, 8))

My two dataframes have the exact same size.

Processing basic checks to not save an unneccessary large file, I remove duplicates and print data frame heads to make sure all previous steps worked properly.

In [46]:
startup_df.drop_duplicates(inplace = True)
smbiz_df.drop_duplicates(inplace = True)

In [47]:
startup_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self
0,I'll test your product for 10 mins or give you...,[removed],startups,1610070442,Pool_of_Death,5,2,True
1,Can I build you an API for free?,[removed],startups,1610070624,guru223,2,1,True
2,Can I build you an API?,[removed],startups,1610070654,guru223,2,1,True
3,Better School Meals for Ghana!,,startups,1610070999,nutriprideafrica,0,1,False
4,I am sharing an online discussion event with s...,[removed],startups,1610075207,sapfunnyxd,3,1,True


In [48]:
smbiz_df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self
0,Gusto problems?,,smallbusiness,1610069013,[deleted],0,1,True
1,what would be the best way of calling potentia...,[removed],smallbusiness,1610069951,wilsonckao,2,1,True
2,GURUS VS BOOKS,[removed],smallbusiness,1610069955,Alpha12x,0,1,True
3,Is there a way to spread out a windfall income...,If my business suddenly brings in a thousand t...,smallbusiness,1610070146,if_yes_else_no,5,1,True
4,EIDL/PPP Credit Denials A Thing?,I’m just today hearing from my Friend that bot...,smallbusiness,1610070921,WildFireBrand,3,1,True


In [49]:
df = pd.concat([smbiz_df, startup_df])

In [50]:
df.shape

(19993, 8)

Concatenated the two dataframes I proceedby exporting the dataframe as csv.

In [51]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self
0,Gusto problems?,,smallbusiness,1610069013,[deleted],0,1,True
1,what would be the best way of calling potentia...,[removed],smallbusiness,1610069951,wilsonckao,2,1,True
2,GURUS VS BOOKS,[removed],smallbusiness,1610069955,Alpha12x,0,1,True
3,Is there a way to spread out a windfall income...,If my business suddenly brings in a thousand t...,smallbusiness,1610070146,if_yes_else_no,5,1,True
4,EIDL/PPP Credit Denials A Thing?,I’m just today hearing from my Friend that bot...,smallbusiness,1610070921,WildFireBrand,3,1,True


Happy with the quantity of posts I imported I proceed exporting the dataframe as `.csv`

In [52]:
df.to_csv('../data/corpus.csv', index=False)