# Scrape Twitter Data for Semantic Analysis

In this code, I'm going to scrape any single user's tweets (including retweets) to the Nth scroll refresh.
In this particular tutorial, let's scrape CNN.

#### Require Modules:
Scraping twitter can be done in a number of ways.  The quick method I've employed here, just uses beautiful soup to scrape particular parts of the html as it would any webpage.  The complication with this method is that Twitter restricts the base output from such a function to 20 tweets (and even manually scrolling through more and then running the function doesn't seem to add more data nor is it desirable).  So we have to download and install a package called selenium, which will automatically pull up more tweets.  I've read that if you ping twitter too many times that you can get blocked/cause problems for yourself, so I have a 2s pause for each page scroll.  Not sure how important that is or not.

So the required packages to run this code are: selenium, bs4, time, pymongo, numpy, and pandas.

There are other ways to do this (even ones that use the beautifulsoup method), like finding the html page for a time period of tweets that you want and just scraping that.

##### Input Variables:
1) Twitter-handle (to be added to base url)
2) Number of page scrolls.  Right now, this is hard coded, but can easily be adjusted to be a % of total tweets, etc...



In [6]:
#### Load Modules
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pymongo
import numpy as np
import pandas as pd

In [5]:
username='CNN'
url = "https://twitter.com/" + username

In [7]:
#This next step opens up a chrome window and sets it up for automatic scrolling
# Open webdriver and url
driver = webdriver.Chrome()
#driver = webdriver.firefox()
driver.get(url)
driver.maximize_window()

In [8]:
# Continuously scroll the window, pausing 2s after each scroll

# lastHeight = driver.execute_script("return document.body.scrollHeight")
# lastHeight = lastHeight/5
lastHeight=100
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight

In [9]:
#Now that we have a full scroll, use beautiful soup to extract the page contents.  
#This will collect a tooonnnn of extraneous info, so the next step is to grab only what we want.
#The positive flip side is that there is a ton of meta-data to extract if you want.
pho = BeautifulSoup(driver.page_source, "html.parser")

In [10]:
#For our purposes, we are going to extract the info of the tweet origin (user we are currently looking at or retweet),
# the text of the tweet, the tweet unique ID, and the tweet timestamp.
#All info is saved in tweetFrame

tweets = pho.find_all('li',"js-stream-item")
tweetFrame = pd.DataFrame(columns=["tweet_user","tweet_text","tweet_id","timestamp"])
tweetslist = []
for tweet in tweets:
    if tweet.find('p','tweet-text'):
        tweet_user = tweet.find('span','username').text
        tweet_text = tweet.find('p','tweet-text').text.encode('utf8')
        tweet_id = tweet['data-item-id']
        timestamp = tweet.find('a','tweet-timestamp')['title']
        tweetslist.append([tweet_user,tweet_text,tweet_id,timestamp])
    else:
        continue
tweetFrame = pd.DataFrame(tweetslist,columns=["tweet_user","tweet_text","tweet_id","timestamp"])

In [11]:
#Let's take a look!
tweetFrame

Unnamed: 0,tweet_user,tweet_text,tweet_id,timestamp
0,@CNN,Some people have taken genetic tests to see ho...,898649381452304386,1:54 PM - 18 Aug 2017
1,@CNN,New bill seeks to curb tech's sexual harassmen...,898646989570588672,1:45 PM - 18 Aug 2017
2,@CNN,One month: @BrookeBCNN lists President Trump's...,898645513817305088,1:39 PM - 18 Aug 2017
3,@CNN,We’re in Times Square on World Humanitarian Da...,898644995183263744,1:37 PM - 18 Aug 2017
4,@CNN,Who has left the Trump administration? Here's ...,898644317505564674,1:34 PM - 18 Aug 2017
5,@CNN,President Trump has had a chaotic past four we...,898639722037235713,1:16 PM - 18 Aug 2017
6,@CNN,Donald Trump fired Steve Bannon. But Bannon st...,898637528298774528,1:07 PM - 18 Aug 2017
7,@CNN,"This fall, Colorado will deploy a fully autono...",898632986157101060,12:49 PM - 18 Aug 2017
8,@CNN,Remaining 16 members of the President's Commit...,898624809017507841,12:17 PM - 18 Aug 2017
9,@CNN,'#WAR': Breitbart set to take on Trump White H...,898620454881169409,11:59 AM - 18 Aug 2017


In [None]:
# From here, you can edit the tweet data to pull out punctuation or analyze number of tweets vs retweets etc...
# Or just save and do something later.

In [17]:
#Save the data
saveName=username+'.csv'
tweetFrame.to_csv(saveName)