<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Scrape_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🐔 Scrape tweets

**Input:** a query (like in Twitter search), a start date, and an end date.

**Output:** a list of tweets with metadata (CSV).

This script scrapes tweets (no API key needed). It does it **day per day**, with a fixed maximum number of tweets per day. Therefore it is **not** appropriate for tracking the number of tweets per day. It works well to harvest a comparable number of tweets per day and follow their profile (mentions, hashtags, etc.).

## How to use

1. Edit the settings
1. Run all the cells
1. Take the output file from the notebook folder

### How to build a Twitter query?
* You could use a single word, or a hashtag
* You can use AND, OR and parentheses
* You can search media-specific fields: check the [official documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query).
* Do not use "since:" and "until:" in the query, because we deal with time a different way.

# SETTINGS

In [None]:
# Settings
search_query = '#chatgpt'
since = '2023-03-01'
until = '2023-03-10'
max_results_per_day = 100

# Output file
output_file = "tweets.csv"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install and import necessary libraries
!pip install snscrape # SNScrape - the Twitter scraping library
import pandas as pd # Pandas: deal with the data
import snscrape.modules.twitter as sntwitter # SNScrape: get Twitter data
import datetime

### Metadata of the tweets that we keep

In [None]:
# Tweet features to keep in the output
features = [
  "url",
  "date",
  "id",
  "user",
  #"card",
  #"cashtags",
  "coordinates",
  "hashtags",
  "inReplyToTweetId",
  "inReplyToUser",
  "media",
  "mentionedUsers",
  "place",
  "quotedTweet",
  "retweetedTweet",
  "sourceLabel",
  #"sourceUrl",
  #"renderedContent",
  "replyCount",
  "retweetCount",
  "likeCount",
  "quoteCount",
  "conversationId",
  "lang",
  #"source"
]

### Scrape the tweets

In [None]:
tweets = []
currentDate = datetime.date.fromisoformat(since)
maxDate = datetime.date.fromisoformat(until)
while currentDate<maxDate:
  nextDate = currentDate + datetime.timedelta(days=1)
  print("# Harvesting new day "+str(currentDate))
  # Harvest the tweets
  query = "since:"+str(currentDate)+" until:"+str(nextDate)+" "+search_query
  print("QUERY: "+query)
  scraper = sntwitter.TwitterSearchScraper(query)
  for i, tweet in enumerate(scraper.get_items()):
    data = []
    for f in features:
      if f == "card":
        if tweet.card:
          data.append(tweet.card.title)
        else:
          data.append("")
      elif f == "cashtags":
        data.append(";".join(tweet.cashtags))
      elif f == "coordinates":
        if tweet.coordinates:
          data.append("{};{}".format(tweet.coordinates.latitude, tweet.coordinates.longitude))
        else:
          data.append("")
      elif f == "hashtags":
        if tweet.hashtags:
          data.append(";".join(tweet.hashtags))
        else:
          data.append("")
      elif f == "inReplyToTweetId":
        if tweet.inReplyToTweetId:
          data.append(tweet.inReplyToTweetId)
        else:
          data.append("")
      elif f == "inReplyToUser":
        if tweet.inReplyToUser:
          data.append(str(tweet.inReplyToUser).replace("https://twitter.com/", ""))
        else:
          data.append("")
      elif f == "media":
        if tweet.media:
          medias = []
          for m in tweet.media:
            if hasattr(m,"fullUrl"):
              medias.append(m.fullUrl)
            elif hasattr(m,"thumbnailUrl"):
              medias.append(m.thumbnailUrl)
            elif hasattr(m,"url"):
              medias.append(m.url)
            else:
              medias.append("MISSING")
          data.append(";".join(medias))
        else:
          data.append("")
      elif f == "mentionedUsers":
        if tweet.mentionedUsers:
          data.append(";".join([u.username for u in tweet.mentionedUsers]))
        else:
          data.append("")
      elif f == "place":
        if tweet.place:
          data.append(tweet.place.fullName)
        else:
          data.append("")
      elif f == "quotedTweet":
        if tweet.quotedTweet:
          data.append(tweet.quotedTweet)
        else:
          data.append("")
      elif f == "retweetedTweet":
        if tweet.retweetedTweet:
          data.append(tweet.retweetedTweet)
        else:
          data.append("")
      elif f == "sourceLabel":
        if tweet.sourceLabel:
          data.append(tweet.sourceLabel)
        else:
          data.append("")
      elif f == "sourceUrl":
        if tweet.sourceUrl:
          data.append(tweet.sourceUrl)
        else:
          data.append("")
      elif f == "url":
        if tweet.url:
          data.append(tweet.url)
        else:
          data.append("")
      elif f == "date":
        if tweet.date:
          data.append(tweet.date)
        else:
          data.append("")
      elif f == "renderedContent":
        if tweet.renderedContent:
          data.append(tweet.renderedContent)
        else:
          data.append("")
      elif f == "id":
        if tweet.id:
          data.append(tweet.id)
        else:
          data.append("")
      elif f == "user":
        if tweet.user:
          data.append(tweet.user.username.replace("https://twitter.com/", ""))
        else:
          data.append("")
      elif f == "replyCount":
        if tweet.replyCount:
          data.append(str(tweet.replyCount))
        else:
          data.append("0")
      elif f == "retweetCount":
        if tweet.retweetCount:
          data.append(str(tweet.retweetCount))
        else:
          data.append("0")
      elif f == "likeCount":
        if tweet.likeCount:
          data.append(str(tweet.likeCount))
        else:
          data.append("0")
      elif f == "quoteCount":
        if tweet.quoteCount:
          data.append(str(tweet.quoteCount))
        else:
          data.append("0")
      elif f == "conversationId":
        if tweet.conversationId:
          data.append(tweet.conversationId)
        else:
          data.append("")
      elif f == "lang":
        if tweet.lang:
          data.append(tweet.lang)
        else:
          data.append("")
      elif f == "source":
        if tweet.source:
          data.append(tweet.source)
        else:
          data.append("")

    tweets.append(data)

    if i>0 and i%1000 == 0:
      print("  "+str(i)+" tweets harvested for that day...")

    if i > max_results_per_day:
      break
      
  currentDate = nextDate
  print("")

print("Done.")

### Save as a CSV

In [None]:
# Convert the list of tweets to a DataFrame and save it to a CSV file
tweetdf = pd.DataFrame(tweets, columns=features)
try:
  tweetdf.to_csv(output_file, index = False, encoding='utf-8', sep=',')
except IOError:
  print("/!\ Error while writing the output file")
print("Done.")