# Making sense of snscrape

Author: Sid

In [2]:
%load_ext autoreload
%autoreload 2

Make sure snscrape is is installed, you can do this by running `pip install snscrape` in your terminal. Then run the below cell to import the required libraries.

In [3]:
import pandas as pd
from basescrape import *
import snscrape.modules.twitter as sntwitter

## Why snscrape?
Over the recent years, Twitter API has gotten a lot worse. Most features have been paywalled, and the ones that are available are heavily rate limited so that the servers dont get overloaded. While this is a good thing in theory for stability, it isnt ideal for people who want to actually scrape tweets from twitter and analyse data about them. Using snscrape is a good way to harvest data without using the API. The only issue is, its really poorly documented in terms of python modules and object names. This notebook explores some of the different scrapers in the library.

## Looking at different modules

Some modules include:
* TwitterProfileScraper
* TwitterTweetScraper
* TwitterUserScraper
* TwitterTrendsScraper

For the purposes of the project we probably want to look at TwitterUserScraper, as it doesnt include other people mentioning the user that we are requesting the search for. 

You can look at all the contents that it gets from the most recent tweet in a profile, and from that we can grab all of the useful data we need and manipulate it however we want.

Below is a cell that will grab the data of the latest tweet from NAME, try it if you want and see all the info we can use.

In [4]:
NAME = "30plem" #paste the twitter handle here without the '@'
user_scraper = sntwitter.TwitterUserScraper(NAME).get_items()
print (type(user_scraper))
for tweet in user_scraper:
    break
tweet

<class 'generator'>


Tweet(url='https://twitter.com/30plem/status/1632938845216858112', date=datetime.datetime(2023, 3, 7, 2, 58, 53, tzinfo=datetime.timezone.utc), rawContent='@Grant2Will these replies 😭', renderedContent='@Grant2Will these replies 😭', id=1632938845216858112, user=User(username='30plem', id=3380634093, displayname='ok', rawDescription='yea', renderedDescription='yea', descriptionLinks=None, verified=False, created=datetime.datetime(2015, 7, 17, 18, 29, 11, tzinfo=datetime.timezone.utc), followersCount=89, friendsCount=129, statusesCount=6177, favouritesCount=3871, listedCount=1, mediaCount=446, location='', protected=False, link=None, profileImageUrl='https://pbs.twimg.com/profile_images/1544587916462043137/NNzDmMXu_normal.jpg', profileBannerUrl='https://pbs.twimg.com/profile_banners/3380634093/1657093558', label=None), replyCount=0, retweetCount=0, likeCount=0, quoteCount=0, conversationId=1631722353926316046, lang='en', source='<a href="https://mobile.twitter.com" rel="nofollow">Twitter

As you can see, there are a bunch of fields that the module pulled from the latest tweet. For example, `date` gives you the date and time that the tweet happened, retweetCount and likeCount give you the number of retweets and likes that a tweet has gotten respectively. We can call these values individually by calling the iterable variable, in this case `tweet`, followed with a dot and the field that we want to look at.

`tweet.displayname` returns the display name of the twitter user, while `tweet.followersCount` tells you the amount of followers that the user had at the time of tweeting.

## Pulling specific data from a tweet

In `basescrape.py`, there is a function called `get_tweet` which returns the rawContent of a tweet. Using the content object causes deprecation issues.

Feel free to test grabbing the content of a tweet from any account using the code cell below.

If the tweet is an image, a shortened hyperlink to the tweet will be the output.

In [16]:
NAME = "jack"
tweet = get_tweet(NAME)
print(tweet)

and Google Play Store: https://t.co/1Ve7GIBG0F

and of course, the open web: https://t.co/qXl9xrmtKE (one of many)


You can call multiple of these elements, and print the output in a list to see a list of key statistics for a tweet.

The `get_tweet_data` function does this and returns the date, tweet, and contents in the most recent tweet on a profile.

In [9]:
NAME = "jack"
tweet_data = get_tweet_data(NAME)
print(tweet_data)

[[datetime.datetime(2023, 1, 31, 22, 38, 10, tzinfo=datetime.timezone.utc), 'and Google Play Store: https://t.co/1Ve7GIBG0F\n\nand of course, the open web: https://t.co/qXl9xrmtKE (one of many)', 3273]]


As you can see, the date is in a weird format. However, printing the date separately works fine. This is because the date is being stored as a datetime object, and you can call different aspects if you want to compare them to an existing date time. This is an approach you can take to compare tweet times to times of stock price drops, if you wanted to. Use [this](https://www.listendata.com/2019/07/how-to-use-datetime-in-python.html#id-886a0d) link to learn more about datetime objects.

For the purposes of this project, it is best to convert the datetime object into a string, which can then be broken down into aspects for date and time. THese can then be compared to other strings of dates and times, which might be useful later. The `datetime` class has a function which can this, and below you can see its implementation.
Run the code cell below to import it, then it can be tested with a date time object from a scraped tweet.

(For simplicity, a `get_tweet_date` function was defined that gets the date of the latest tweet from a profile.)

In [20]:
from datetime import datetime
NAME = "jack"
dates = get_tweet_date(NAME)
print(dates)
date = dates.strftime("%d%m%Y")
time = dates.strftime("%H%M%S")
date_type=type(date)
time_type=type(time)
print(f"The date of this tweet is {date} and it is a {date_type}.")
print(f"The time of this tweet is {time} and it is a {time_type}.")


2023-01-31 22:38:10+00:00
The date of this tweet is 31012023 and it is a <class 'str'>.
The time of this tweet is 223810 and it is a <class 'str'>.


## Scraping tweets within a certain timeframe.
Using the basis of the previous functions, the elements can be combined to scrape a profile for a certain number of tweets. It can then later be customized to allow for the number of tweets scraped to be determined, as well as all the tweets tweeted in a certain month. The function `get_thou_tweets` starts by taking the twitter handle as an input, and stores the list of the last 10000 tweets in a csv called `"NAME.csv"` in the `raw-data` folder, where name is the display name of the twitter profile. The data is stored as a pandas dataframe, with the rows representing the date, time, content, and view count of the tweet respectively. The dataframe is the return for further analysis and plotting.

In [7]:
NAME = "elonmusk"

tweets_df = get_thou_tweets(NAME)

You may have noticed that this took a very long time, and if we are planning to scrape tweets across longer time periods, then the number of tweets being stored has to be reduced for time and space reasons. One way that this can be done is by only appending tweets that meet certain parameters set.

One of these parameters is the month that the tweet happens in. The function `get_tweets_in_month` scans the same 10000 tweet period, but this time only tweets within a specifed month are saved to the csv. This data is viewable from the `month-data` folder.

Note: This function differs to the previous ones in a few ways. The `statusesCount` (total tweets) value needs to be grabbed freom an object within the `user` object, so that the loop can be broken out of if the loop index surpasses it. The position of the loop conditions changes, and the additional object call takes place at the start of the for loop.

In [4]:
NAME = "elonmusk"

tweets_df = get_tweets_in_month(NAME, "")

This still takes a long time, so when working with this data it may be best to work on a pre-existing csv, and place the new edited data onto a separtae file, that has everything we are looking for. The dataframe can then be scraped through row by row, with only specific dates and/or times being included. This drastically decreases the computational workload, and allows us to map the instance of a tweet to any graph we want.

## Excluding dates and times
In some cases, you may want to exclude all tweets tweeted past a certain time, as they are not relevant. To make the target datafile as manageable as possible, a number of truncations and assumptions need to be made about the tweets being scraped.

### Truncating tweets after-hours

Some tweets are not always seen by everyone, and they may not even have a link to adjecent charts that we may be interested in. It can therefore be useful to only store tweets tweeted before and after certain periods.