# Background
The goal of the project is to analyze the tweeting behavior of the 100 US Senators in the US Senate. Each senator has a Twitter account and puts out tweets reflecting their thought and communicating with the people in the state they represent. The goal of the project questions is to guide you through the steps of getting the data, processing and cleaning it, putting it in a format that makes it easier to analyze and then doing some basic analysis. The last few questions ask you to see if whether a senator mentions a president or presidential candidate depends on the party that the senator is part of. For example, do Democratic senators mention Barack Obama in their tweets more or less than Republican senators?

For background information only: The file project/fetch_senator_tweets.py downloads tweets using the Python twitter package to interact with Twitter's API. You will only be able to run that code if you set up your own Twitter account and follow the instructions at the start of the file regarding filling in the authentication information (CONSUMER_KEY, CONSUMER_SECRET, etc.).

I've already run the code mentioned above and downloaded the data for you. The downloaded information on the senators' twitter accounts is in project/senators-list.json in the Github repository, while the downloaded tweets are in timelines.json. timelines.json is too big to put in the Github repository. You can find it at http://www.stat.berkeley.edu/~paciorek/transfer/timelines.json. Note that there are only 200 tweets for each senator because of limits on how many tweets can be accessed in a given request.



### 1. Load the senators-list.json and timelines.json files into Python as objects called senators and timelines.

In [2]:
import json
from urllib.request import urlopen

# Load senators-list.json
with open('project/senators-list.json', 'r') as senators_file:
    senators = json.load(senators_file)

# Load timelines.json
timelines_url = "http://www.stat.berkeley.edu/~paciorek/transfer/timelines.json"
timelines_response = urlopen(timelines_url)
timelines = json.loads(timelines_response.read())


### 2. What type of datastructure is timelines? How many timelines are there? What does each timeline correspond to?

In [3]:
# timelines structrue
print(type(timelines), type(timelines[0]),type(timelines[0][0]))

# The timelines is a list with two levels
# The first level are 100 Senators
# The second level are the 200 tweets for each senators
print(len(timelines), len(timelines[0]))

<class 'list'> <class 'list'> <class 'dict'>
100 200


### 3. Make a list of the number of followers each senator has.

In [4]:
# keys in each senators dict.
keys = set()
user_keys = set()

for s in range(len(timelines)):
    for t in range(len(timelines[s])):
        tweet = timelines[s][t]
        keys.update(tweet.keys())
        user_keys.update(tweet["user"].keys())

In [5]:
follower_count_list = []

for s in range(len(timelines)):

    # use sets to check value uniqueness
    names, followers, screen_names = set(), set(), set()
    for t in range(len(timelines[s])):
        user = timelines[s][t]["user"]    
        name, followers_count, screen_name = user["name"], user["followers_count"], user["screen_name"]        
        names.add(name)
        followers.add(followers_count)
        screen_names.add(screen_name)

    # directly initializes the final values by using the pop() method on the sets.
    final_name = names.pop() if len(names) == 1 else "multiple names detected"
    final_followers = followers.pop() if len(followers) == 1 else "multiple followers count detected"
    final_screen_name = screen_names.pop() if len(screen_names) == 1 else "multiple screen names detected"
    
    follower_count_list.append({'name': final_name, 'screen_name': final_screen_name, 'follower_count': final_followers})

follower_count_list

[{'name': 'Senator Thom Tillis',
  'screen_name': 'SenThomTillis',
  'follower_count': 12100},
 {'name': 'Senator Ben Sasse',
  'screen_name': 'SenSasse',
  'follower_count': 29204},
 {'name': 'Senator Mike Rounds',
  'screen_name': 'SenatorRounds',
  'follower_count': 7220},
 {'name': 'SenDanSullivan',
  'screen_name': 'SenDanSullivan',
  'follower_count': 5569},
 {'name': 'David Perdue',
  'screen_name': 'sendavidperdue',
  'follower_count': 10900},
 {'name': 'Joni Ernst',
  'screen_name': 'SenJoniErnst',
  'follower_count': 17406},
 {'name': 'Senator Brian Schatz',
  'screen_name': 'SenBrianSchatz',
  'follower_count': 10548},
 {'name': 'Martin Heinrich',
  'screen_name': 'MartinHeinrich',
  'follower_count': 17135},
 {'name': 'Senator Jim Risch',
  'screen_name': 'SenatorRisch',
  'follower_count': 20859},
 {'name': 'Sen. Tammy Baldwin',
  'screen_name': 'SenatorBaldwin',
  'follower_count': 25711},
 {'name': 'Senator Ted Cruz',
  'screen_name': 'SenTedCruz',
  'follower_count': 83

### 4. What is the screen name of the senator with the largest number of followers.

In [6]:
# find the senator's screen name with most followers

max_entry = max(follower_count_list, key=lambda x: x['follower_count'])

max_screen_name = max_entry['screen_name']
max_followers = max_entry['follower_count']

print(max_screen_name, max_followers)
        

SenSanders 3333457


### 5. Make a list of lists where the outer list represents senators and the inner list contains the text of each senator's tweets, and call it tweets.



In [22]:
tweets = [[tweet['text'] for tweet in senator] for senator in timelines]
print(tweets)



### 6. Write a function, called remove_punct, that takes a word and returns the word with all punctuation characters removed, except for those that occur within a word.

In [30]:
import string
def remove_punct(word):
    punct_to_remove = set(string.punctuation) - {"'", "-"}  # Keep apostrophe
    cleaned_word = ''.join(char for char in word if 
                           char not in punct_to_remove)
    return cleaned_word

### 7. Write a function that takes tweet and returns a cleaned up version of the tweet. Here is an example function to get you started:
Note that the function I've provided is a bit buggy - it has some problems with some tweets. If your goal is to convert the tweet into a discrete set of words, what is going wrong here? Fix up and extend the example function.


In [104]:
def clean(tweet):
    words = [word.lower() for word in tweet.split() if 'http' not in word]
    if words[0] == 'rt':
        words = words[2:]
    return ' '.join(words)
clean(tweets[0][30])

'.@thomtillis took the ellie challenge and challenged other senators cc: @cure4ellie'

### 8.Use the following file to create a list, called stopwords, that contains common english words:
http://www.textfixer.com/resources/common-english-words.txt. Make sure to pull the data into Python by writing Python code to download and suck the data into Python.

In [66]:
import requests

url = "http://www.textfixer.com/resources/common-english-words.txt"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    stopwords = response.text.split(',')
else:
    print("Failed to retrieve data from the URL.")
print(stopwords)


['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'but', 'by', 'can', 'cannot', 'could', 'dear', 'did', 'do', 'does', 'either', 'else', 'ever', 'every', 'for', 'from', 'get', 'got', 'had', 'has', 'have', 'he', 'her', 'hers', 'him', 'his', 'how', 'however', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'least', 'let', 'like', 'likely', 'may', 'me', 'might', 'most', 'must', 'my', 'neither', 'no', 'nor', 'not', 'of', 'off', 'often', 'on', 'only', 'or', 'other', 'our', 'own', 'rather', 'said', 'say', 'says', 'she', 'should', 'since', 'so', 'some', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'this', 'tis', 'to', 'too', 'twas', 'us', 'wants', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'yet', 'you', 'your']


### 9.Write a function, called tokenize, which takes a tweet, cleans it, and removes all punctuation and stopwords.

In [115]:
import requests
import string

# Load stopwords into a set for faster membership testing
stopwords_url = "http://www.textfixer.com/resources/common-english-words.txt"
response = requests.get(stopwords_url)
stopwords = set(response.text.split(','))

# Punctuation to remove except for apostrophe and hyphen
punct_to_remove = set(string.punctuation) - {"'", "-"}

def tokenize(tweet):
    # Clean the tweet
    words = [word.lower() for word in tweet.split() if 'http' not in word]
    
    # Remove 'rt' prefix
    if words and words[0] == 'rt':
        words.pop(0)
    
    # Remove punctuation except for apostrophe and hyphen
    words = [''.join(char for char in word if char not in punct_to_remove) for word in words]
    
    # Remove stopwords using set-based membership testing
    cleaned_words = [word for word in words if word not in stopwords]
    
    return ' '.join(cleaned_words)

tokenize(tweets[0][30])

'heardonthehill thomtillis took ellie challenge challenged senators cc cure4ellie'

### 10. Create a list of lists, tweets_content, using your tokenize function.

In [117]:
tweets_content = [[tokenize(tweet['text']) for tweet in senator] for senator in timelines]
tweets_content

[['fayobserver holiday greetings local troops stationed around globe',
  'read op-ed fayobserver securing over 300 million hurricane matthew relief efforts',
  'senategop agscottpruitt leader help epa one overreaching federal agencies working americ…',
  'spent time 24thmeu during pre-deployment training aboard uss bataan amp uss mesa verde wishing each sailor amp m…',
  "craigderoche's op-ed criminal justice issues principles trumping partisanship via newsobserver",
  'provision help military kids autism milkids via fayobserver drewbrooks',
  'congrats appstatefb winning camelliabowl definethemoment',
  "honor greeting nation's troops way home holidays during usoofnc's operation exod…",
  'north carolinians affected hurricane matthew until jan 9 apply w fema federal disaster assistance',
  'johncornyn celebrating 225 years 🇺🇸',
  'joined sengillibrand secure autism care demonstration program military families ndaa…',
  "nickblue2016 i'm glad secure over 300m recovery assistance help n

### 11. Create a list, tokens, where all 200 of each senator's tweets are made into a single string. 
Hint: this syntax might be useful: " ".join(my_list_of_strings).

In [122]:
tokens = [" ".join([tweets for tweets in senator]) for senator in tweets_content]
tokens

 "need choose foreign policy objectives cautious way-but choose mean amp allies… federalism nebraska amp vermont different places different people different values and… happy billofrightsday today celebrate government limited people's rights limitless onthisday 1799 george washington died legacy humble leadership still sets bar… proud work senbrianschatz bring 21st century solution city's 20th century approach data\u200b… draft daughters debate completely unnecessary ndaa victory common sense… talk work trade big topic it's much smaller topic ai automation amp transformati… remembering nebraskan victims pearlharbor special way today 75 years later never forget you… senatorfischer watch latest tribute fallen nehero sgt germaine debro omaha heritage america today 41 people under age 35 think 1st amendment dangerous horror- sensasse heri… american adult understand themselves part-time politician — especially those washington government create jobs - here's bears repeating obama administra

In [None]:
### 12. 