# Table of Contents

* Introduction to Data Scraping
* Data Scraping from Twitter
* Data Scraping from Reddit
* Acknowledgements

# Introduction to Data Scraping

Data is available everywhere. To perform various Data Science experiments we often need to extract data from various sources. 

We will use the codes in this notebook to extract data from some of the popular websites (Twitter and Reddit). The codes published here can be used to extract tweets based on the user's requirement and converted to a Pandas dataset. For Reddit, you can scrape an entire subreddit and convert it into a Pandas dataset.

### Method 3 for Twitter uses snscrape which requires Python version 3.8. As Kaggle notebooks run on Python version 3.7 and I could not find a reliable work around on this, the codes for Method 3 have been commented to avoid error on execution. 

### Please use Python version 3.8 or higher on Jupyter notebooks to run the codes for Twitter Method 3.

# Data Scraping from Twitter

### You need a Twitter Developer account for Method 1 and 2

## Method 1 - Twitter Scraper using Keywords

##### Extract all tweets based on a keyword, e.g. Covid-19, DataScience, etc.

### 1. Install Tweepy and Import Libraries

In [1]:
pip install tweepy

Collecting tweepy
  Downloading tweepy-4.6.0-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 531 kB/s 
[?25hCollecting oauthlib<4,>=3.2.0
  Downloading oauthlib-3.2.0-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 986 kB/s 
Collecting requests<3,>=2.27.0
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 888 kB/s 
Installing collected packages: requests, oauthlib, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.25.1
    Uninstalling requests-2.25.1:
      Successfully uninstalled requests-2.25.1
  Attempting uninstall: oauthlib
    Found existing installation: oauthlib 3.1.1
    Uninstalling oauthlib-3.1.1:
      Successfully uninstalled oauthlib-3.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency

In [2]:
import os
import tweepy as tw
import pandas as pd
from tqdm import tqdm, notebook

### 2. Twitter API Authentication

#### Pass in the CONSUMER_API_KEY and CONSUMER_API_SECRET from your Twitter Developer account. 

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

consumer_api_key = user_secrets.get_secret("CONSUMER_API_KEY")
consumer_api_secret = user_secrets.get_secret("CONSUMER_API_SECRET")

In [4]:
auth = tw.OAuthHandler(consumer_api_key, consumer_api_secret)
api = tw.API(auth, wait_on_rate_limit=True)

### 3. Tweets Query

#### 3.1 Define the Query

In the below cell, we are collecting all (max=500) DataScience tweets since 1st Jan, 2020.

In [5]:
search_words = "#datascience -filter:retweets"
date_since = "2021-01-01"
# # Collect tweets
tweets = tw.Cursor(api.search_tweets,
              q=search_words,
              lang="en",
              since=date_since).items(500)

#### 3.2 Retrieve the tweets

In [6]:
tweets_copy = []
for tweet in tqdm(tweets):
     tweets_copy.append(tweet)

500it [00:12, 41.33it/s]


In [7]:
print(f"new tweets retrieved: {len(tweets_copy)}")

new tweets retrieved: 500


### 4. Populate the Dataset

#### Extract the information contained in a tweet into a Pandas dataframe

In [8]:
tweets_df = pd.DataFrame()
for tweet in tqdm(tweets_copy):
    hashtags = []
    try:
        for hashtag in tweet.entities["hashtags"]:
            hashtags.append(hashtag["text"])
        text = api.get_status(id=tweet.id, tweet_mode='extended').full_text
    except:
        pass
    tweets_df = tweets_df.append(pd.DataFrame({'user_name': tweet.user.name, 
                                               'user_location': tweet.user.location,
                                               'user_description': tweet.user.description,
                                               'user_created': tweet.user.created_at,
                                               'user_followers': tweet.user.followers_count,
                                               'user_friends': tweet.user.friends_count,
                                               'user_favourites': tweet.user.favourites_count,
                                               'user_verified': tweet.user.verified,
                                               'date': tweet.created_at,
                                               'text': text, 
                                               'hashtags': [hashtags if hashtags else None],
                                               'source': tweet.source,
                                               'is_retweet': tweet.retweeted}, index=[0]))

100%|██████████| 500/500 [02:22<00:00,  3.50it/s]


#### Check head of the dataframe

In [9]:
tweets_df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Salynt,"Washington, DC",Salynt provides data scientists and software e...,2021-08-11 19:50:54+00:00,47,20,54,False,2022-02-25 11:43:18+00:00,We're fundamentally changing how software deve...,"[softwareengineering, datascience, AI]",Twitter Web App,False
0,Bharat,"Mumbai, India",Data Enthusiast. NLP/Text Analytics.,2011-10-17 10:07:17+00:00,635,424,11177,False,2022-02-25 11:42:08+00:00,Practical Advice for R in Production - Answer...,"[Analytics, DataScience, AI, ML, RStats, Python]",NadarSenpai,False
0,Saransh Inc,"Plainsboro, New Jersey",We are a people-centric company dedicated towa...,2020-02-28 09:35:19+00:00,1968,1928,3,False,2022-02-25 11:41:23+00:00,Data can be simply defined as 'what you need t...,,Twitter Web App,False
0,Nathan Joyner,"Los Angeles, CA",Global Venture Captial and Private Equity/Busi...,2015-05-18 20:52:29+00:00,60,11,824,False,2022-02-25 11:40:58+00:00,Daily Confirmed Covid Cases per 1K Population ...,,smcapplication,False
0,"Richard Eudes, PhD","Paris, France",Director @Deloitte. Long-time expert in #DataS...,2009-06-21 21:04:32+00:00,17923,1849,1472,False,2022-02-25 11:40:01+00:00,Design Patterns in Machine Learning for MLOps ...,"[analytics, datascience, bigdata]",Buffer,False


### 5. Save the Data

#### 5.1 Read past data

##### Skip this part for the very first execution as there is no past data. Instead save your dataframe directly to a csv and use this part for the next runs

In [10]:
tweets_old_df = pd.read_csv("../input/data-scraping-data-science-tweets/datascience_tweets.csv")

print(f"past tweets: {tweets_old_df.shape}")

past tweets: (500, 13)


#### 5.2 Merge Past and Present Data

In [11]:
tweets_all_df = pd.concat([tweets_old_df, tweets_df], axis=0)

print(f"new tweets: {tweets_df.shape[0]} past tweets: {tweets_old_df.shape[0]} all tweets: {tweets_all_df.shape[0]}")

new tweets: 500 past tweets: 500 all tweets: 1000


#### 5.3 Drop Duplicates

In [12]:
tweets_all_df.drop_duplicates(subset = ["user_name", "date", "text"], inplace=True)
print(f"all tweets: {tweets_all_df.shape}")

all tweets: (1000, 13)


#### 5.4 Export the updated data

In [13]:
tweets_all_df.to_csv("datascience_tweets.csv", index=False)

## Method 2 - Tweet Extractor using Twitter username

##### Extract tweets of a particular user using the screen_name

#### 1. Import libraries

In [14]:
import tweepy
import pandas as pd

#### 2. Twitter Authentication

In [15]:
#Pass in the below parameters from your Twitter Developer account
access_key = user_secrets.get_secret("ACCESS_KEY")
access_secret = user_secrets.get_secret("ACCESS_SECRET")
consumer_key = user_secrets.get_secret("CONSUMER_API_KEY")
consumer_secret = user_secrets.get_secret("CONSUMER_API_SECRET")

In [16]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

In [17]:
def get_all_tweets(screen_name):
    alltweets = []
    new_tweets = api.user_timeline(screen_name = screen_name,count=200) #Using the Twitter user_timeline API
    alltweets.extend(new_tweets)
    oldest = alltweets[-1].id - 1
    
    while len(new_tweets) > 0:
        print("getting tweets before %s" % (oldest))
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
        alltweets.extend(new_tweets)
        oldest = alltweets[-1].id - 1
        print ("...%s tweets downloaded so far" % (len(alltweets)))

        data=[[obj.user.screen_name,obj.user.name,obj.user.id_str,obj.user.description.encode("utf8"),obj.created_at.year,obj.created_at.month,obj.created_at.day,"%s.%s"%(obj.created_at.hour,obj.created_at.minute),obj.id_str,obj.text.encode("utf8")] for obj in alltweets ]
        dataframe=pd.DataFrame(data,columns=['screen_name','name','twitter_id','description','year','month','date','time','tweet_id','tweet'])
        dataframe.to_csv("%s_tweets.csv"%(screen_name),index=False)

#### Pass in the username of the account you want to download

In [18]:
if __name__ == '__main__':
	get_all_tweets("jack")

getting tweets before 1463382290856243201
...399 tweets downloaded so far
getting tweets before 1445090973772562431
...598 tweets downloaded so far
getting tweets before 1423771357347778561
...798 tweets downloaded so far
getting tweets before 1402719410637389825
...998 tweets downloaded so far
getting tweets before 1360720695337000961
...1197 tweets downloaded so far
getting tweets before 1318724213432254465
...1397 tweets downloaded so far
getting tweets before 1284348041084854271
...1596 tweets downloaded so far
getting tweets before 1267980322492125183
...1795 tweets downloaded so far
getting tweets before 1249880517966589951
...1994 tweets downloaded so far
getting tweets before 1229864795152666625
...2194 tweets downloaded so far
getting tweets before 1204634949296410624
...2393 tweets downloaded so far
getting tweets before 1187904787976773631
...2589 tweets downloaded so far
getting tweets before 1155984590663647232
...2787 tweets downloaded so far
getting tweets before 1131586

## Method 3 - Tweet Extractor using snscrape

##### No Authentication required for using snscrape

# snscrape requires Python version 3.8 but Kaggle notebooks run on version 3.7. To avoid the error on execution, the snscrape codes have been commented.

In [19]:
# import os

### Upgrade the verison of snscrape

In [20]:
# pip install --upgrade git+https://github.com/JustAnotherArchivist/snscrape@master

In [21]:
# #Pass in the username whose tweets you want to pull
#os.system("snscrape --jsonl twitter-search 'from:JeffBezos'> user-tweets.json")

In [22]:
#import pandas as pd

# # Reads the json generated from the CLI commands above and creates a pandas dataframe
#tweets_df = pd.read_json('user-tweets.json', lines=True)

#### Check the shape and Info of the dataframe

In [23]:
#tweets_df.shape

In [24]:
# tweets_df.info()

#### Some more examples

In [25]:
#os.system("snscrape --jsonl --max-results 100 twitter-search 'from:user'> user-tweets.json")

#os.system("snscrape --jsonl --max-results 500 --since 2020-06-01 twitter-search 'its the elephant until:2020-07-31' > text-query-tweets.json")

#### Creating a dataframe from the results of the examples above

In [26]:
# import snscrape.modules.twitter as sntwitter
# import pandas as pd

# # Creating list to append tweet data to
# tweets_list1 = []

# # Using TwitterSearchScraper to scrape data and append tweets to list
# for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:user').get_items()):
#     if i>100:
#         break
#     tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username])
    
# # Creating a dataframe from the tweets list above 
# tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

In [27]:
# tweets_df1

In [28]:
# import pandas as pd

# # Creating list to append tweet data to
# tweets_list2 = []

# # Using TwitterSearchScraper to scrape data and append tweets to list
# for i,tweet in enumerate(sntwitter.TwitterSearchScraper('its the elephant since:2020-06-01 until:2020-07-31').get_items()):
#     if i>500:
#         break
#     tweets_list2.append([tweet.date, tweet.id, tweet.content, tweet.user.username])
    
# # Creating a dataframe from the tweets list above
# tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

In [29]:
# tweets_df2

# Data Scraping from Reddit

In [30]:
# !pip install praw

Collecting praw
  Downloading praw-7.5.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 595 kB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: prawcore, praw
Successfully installed praw-7.5.0 prawcore-2.3.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
import praw
import pandas as pd
import datetime as dt
from tqdm import tqdm
import time

In [5]:
# !pip install user_agent



You should consider upgrading via the 'C:\Users\jonch\code\scrape-social-medias\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [6]:
from user_agent import generate_user_agent, generate_navigator
from pprint import pprint

In [7]:
a_user_agent = generate_user_agent()
pprint(a_user_agent)

import config_reddit

# Script App

CLIENT_ID=config_reddit.CLIENT_ID
CLIENT_SECRET=config_reddit.CLIENT_SECRET
USER_AGENT=a_user_agent # "testscript by u/jonc2000"
USERNAME=config_reddit.USERNAME
PASSWORD=config_reddit.PASSWORD

('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 '
 'Firefox/45.0')
'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'


In [8]:
def get_date(created):
    return dt.datetime.fromtimestamp(created)

In [13]:
#fill in the below Authentication details from Reddit
def reddit_connection():
    personal_use_script = CLIENT_ID
    client_secret = CLIENT_SECRET
    user_agent = USER_AGENT
    username = USERNAME
    password = PASSWORD

    reddit = praw.Reddit(client_id=personal_use_script, \
                         client_secret=client_secret, \
                         user_agent=user_agent, \
                         username=username, \
                         password='')
    return reddit

In [17]:
def build_dataset(reddit, search_words='stablediffusion', items_limit=100):

    # Collect reddit posts
    subreddit = reddit.subreddit(search_words)
    new_subreddit = subreddit.new(limit=items_limit)
    topics_dict = { "title":[],
                "score":[],
                "id":[], "url":[],
                "comms_num": [],
                "created": [],
                "body":[]}

    print(f"retreive new reddit posts ...")
    for submission in tqdm(new_subreddit):
        topics_dict["title"].append(submission.title)
        topics_dict["score"].append(submission.score)
        topics_dict["id"].append(submission.id)
        topics_dict["url"].append(submission.url)
        topics_dict["comms_num"].append(submission.num_comments)
        topics_dict["created"].append(submission.created)
        topics_dict["body"].append(submission.selftext)

    for comment in tqdm(subreddit.comments(limit=None)):
        topics_dict["title"].append("Comment")
        topics_dict["score"].append(comment.score)
        topics_dict["id"].append(comment.id)
        topics_dict["url"].append("")
        topics_dict["comms_num"].append(0)
        topics_dict["created"].append(comment.created)
        topics_dict["body"].append(comment.body)

    topics_df = pd.DataFrame(topics_dict)
    print(f"new reddit posts retrieved: {len(topics_df)}")
    topics_df['timestamp'] = topics_df['created'].apply(lambda x: get_date(x))

    return topics_df

In [24]:
def update_and_save_dataset(topics_df):   
    file_path = "reddit_stablediffusion.csv"
    if os.path.exists(file_path):
        topics_old_df = pd.read_csv(file_path)
        print(f"past reddit posts: {topics_old_df.shape}")
        topics_all_df = pd.concat([topics_old_df, topics_df], axis=0)
        print(f"new reddit posts: {topics_df.shape[0]} past posts: {topics_old_df.shape[0]} all posts: {topics_all_df.shape[0]}")
        topics_new_df = topics_all_df.drop_duplicates(subset = ["id"], keep='last', inplace=False)
        print(f"all reddit posts: {topics_new_df.shape}")
        topics_new_df.to_csv(file_path, index=False)
    else:
        print(f"reddit posts: {topics_df.shape}")
        topics_df.to_csv(file_path, index=False)

In [19]:
# if __name__ == "__main__": 
reddit = reddit_connection()
topics_data_df = build_dataset(reddit)
update_and_save_dataset(topics_data_df)

retreive new reddit posts ...


100it [00:02, 38.72it/s]
995it [01:36, 10.26it/s]

new reddit posts retrieved: 1095
reddit posts: (1095, 8)





In [21]:
topics_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   title      1095 non-null   object        
 1   score      1095 non-null   int64         
 2   id         1095 non-null   object        
 3   url        1095 non-null   object        
 4   comms_num  1095 non-null   int64         
 5   created    1095 non-null   float64       
 6   body       1095 non-null   object        
 7   timestamp  1095 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 68.6+ KB


In [22]:
topics_data_df.to_csv('reddit_stablediffusion_20221026.csv')
topics_data_df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,Use a custom ckpt file for dreamfusion?,2,ydpptd,https://www.reddit.com/r/StableDiffusion/comme...,0,1666761000.0,I'm looking for a way to turn my art made from...,2022-10-26 01:15:51
1,Cat Escapes Mirror Dimension,2,ydphr7,https://i.redd.it/1hms2z7e13w91.jpg,1,1666761000.0,,2022-10-26 01:03:06
2,Running a video through img2img frame by frame...,1,ydpahq,https://youtu.be/JlO02se5Yw8,0,1666760000.0,,2022-10-26 00:51:52
3,Are there AI Model Commissions?,2,ydp6r5,https://www.reddit.com/r/StableDiffusion/comme...,0,1666760000.0,"Bit of a weird question, but do people take co...",2022-10-26 00:45:59
4,Medium Format Film Portraits,0,ydp2d3,https://www.reddit.com/gallery/ydp2d3,0,1666759000.0,,2022-10-26 00:39:11


In [20]:
df = pd.read_csv('reddit_stablediffusion.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'reddit_stablediffusion.csv'

In [38]:
df.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"[SPOILERS] Never watched GoT, but it seemed si...",0,t10uaa,https://i.redd.it/esaepbevhyj81.png,9,1645785000.0,,2022-02-25 10:28:09
1,[Spoilers] A Feast for Crows in the Series,0,t0zgyv,https://www.reddit.com/r/gameofthrones/comment...,4,1645780000.0,"Oh boy, I am listening to this on various audi...",2022-02-25 09:00:23
2,[NO SPOILERS] Where is Braavos?,0,t0xra1,https://www.reddit.com/r/gameofthrones/comment...,11,1645773000.0,Do y'all think Braavos is representative of a ...,2022-02-25 07:08:23
3,[NO SPOILERS] Game of Thrones display,9,t0ufod,https://www.reddit.com/r/gameofthrones/comment...,2,1645762000.0,I just finished watching all 8 seasons of GoT...,2022-02-25 04:06:33
4,[Spoilers] I'm confused about something in S3,2,t0ts25,https://www.reddit.com/r/gameofthrones/comment...,4,1645760000.0,"So, when Jon, Ygritte and the other Wildlings ...",2022-02-25 03:34:08


In [39]:
df.shape

(1972, 8)

# Acknowledgements

1. [Reddit Extract content](https://github.com/gabrielpreda/reddit_extract_content/blob/main/reddit_pfizer_vaccine.py)
2. [Tweet Extractor](https://github.com/gabrielpreda/covid-19-tweets/blob/master/covid-19-tweets.ipynb)
3. [Github Snscrape](https://github.com/MartinBeckUT/TwitterScraper/tree/master/snscrape)
4. [Medium article Snscrape](https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721)