# Data Creation Code
This catalogs the code used for creating our dataset. We will be using data on the United States Congress to generate many lines of text to be analyzed by AWS Comprehend.

### Importing Packages

In [24]:
import boto3
import sagemaker
import json
import os
import pandas as pd
import numpy as np
client = boto3.client('comprehend')

### Importing the CSV file

As we can see below, our Congress dataset is available in our s3 bucket, comprehendproject-qtm350.

In [31]:
! aws s3 ls comprehendproject-qtm350

2023-04-04 17:01:28      22812 Congress.csv


Our dataset of the members of the Senate has been uploaded to an s3 bucket, so we can import that using `s3fs`.

In [25]:
import s3fs
fs = s3fs.S3FileSystem()
with fs.open('s3://comprehendproject-qtm350/Congress.csv', 'rb') as f:
    df = pd.read_csv(f) # convert the csv file to a pandas dataframe
df.head() # preview the dataset

Unnamed: 0,name,sort_name,email,twitter,facebook,group,group_id,area_id,area,chamber,term,start_date,end_date,image,gender,wikidata,wikidata_group,wikidata_area
0,Amy Klobuchar,"Klobuchar, Amy",,SenAmyKlobuchar,,Democrat,democrat,ocd-division/country:us/state:mn,Minnesota,Senate,116,,,https://theunitedstates.io/images/congress/ori...,female,Q22237,Q29552,Q1527
1,"Angus S. King, Jr.","King, Angus",,SenAngusKing,SenatorAngusSKingJr,Independent,independent,ocd-division/country:us/state:me,Maine,Senate,116,,,https://theunitedstates.io/images/congress/ori...,male,Q544464,Q327591,Q724
2,Ben Sasse,"Sasse, Benjamin",,SenSasse,SenatorSasse,Republican,republican,ocd-division/country:us/state:ne,Nebraska,Senate,116,,,https://theunitedstates.io/images/congress/ori...,male,Q16192221,Q29468,Q1553
3,Benjamin L. Cardin,"Cardin, Benjamin",,SenatorCardin,senatorbencardin,Democrat,democrat,ocd-division/country:us/state:md,Maryland,Senate,116,,,https://theunitedstates.io/images/congress/ori...,male,Q723295,Q29552,Q1391
4,Bernard Sanders,"Sanders, Bernard",,SenSanders,senatorsanders,Independent,independent,ocd-division/country:us/state:vt,Vermont,Senate,116,,,https://theunitedstates.io/images/congress/ori...,male,Q359442,Q327591,Q16551


### Cleaning the data

Let's filter our data to just our needed columns.

In [26]:
senators = df[["name","group","gender"]]
senators.head()

Unnamed: 0,name,group,gender
0,Amy Klobuchar,Democrat,female
1,"Angus S. King, Jr.",Independent,male
2,Ben Sasse,Republican,male
3,Benjamin L. Cardin,Democrat,male
4,Bernard Sanders,Independent,male


Let's convert our categorical variables of political party and gender to binary variables. This will make it easier to use our data for regression analysis.

We can code "group" to be a binary variable where Democratic senators are assigned a 1 and Republican Senators are assigned a 0.
We can also code "gender" to a binary variable where female Senators are assigned a 1 and male Senators are assigned a 0.

In [27]:
senators.loc[senators["group"] == "Democrat", 'Democrat'] = 1
senators.loc[senators["group"] == "Republican", 'Democrat'] = 0
senators.loc[senators["gender"] == "female", 'female'] = 1
senators.loc[senators["gender"] != "female", 'female'] = 0
senators.dropna(inplace=True)
senators = senators.reset_index(drop=True)
senators.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


Unnamed: 0,name,group,gender,Democrat,female
0,Amy Klobuchar,Democrat,female,1.0,1.0
1,Ben Sasse,Republican,male,0.0,0.0
2,Benjamin L. Cardin,Democrat,male,1.0,0.0
3,Bill Cassidy,Republican,male,0.0,0.0
4,Brian Schatz,Democrat,male,1.0,0.0


### Generating Sentiment Scores for Senators

Now that our data is cleaned, we can create a function that will run a sentiment test given a specific entity. This entity will then be placed in with the body of the text to create one string. The function will then input this string into an API call to Comprehend. Depending on the option specified, the function returns one of the following outputs:

1. The overall sentiment of the text (Positive, Negative, Mixed, Neutral)
2. The positive sentiment score (0 to 1)
3. The negative sentiment score (0 to 1)
4. The mixed sentiment score (0 to 1)
5. The neutral sentiment score (0 to 1)

In [28]:
# Defining our sentiment function
def sentiment_test(body, entity, option="overall"):
    text = f"{entity} {body}"
    respond = 0

    response = client.batch_detect_sentiment(
    TextList=[text],
    LanguageCode='en')
    
    if option == "overall":
        respond = response['ResultList'][0]['Sentiment']
    elif option == "positive":
        respond = response['ResultList'][0]['SentimentScore']['Positive']
    elif option == "negative":
        respond = response['ResultList'][0]['SentimentScore']['Negative']
    elif option == "mixed":
        respond = response['ResultList'][0]['SentimentScore']['Mixed']
    elif option == "neutral":
        respond = response['ResultList'][0]['SentimentScore']['Neutral']
    return respond

We will now define our different phrases to plug into textract. These will each be paired with a separate entity and tested for their sentiment scores.

In [29]:
phrases = ["claims they want to protect children from the dangers of social media.", # neutral
           "introduced new gun control legislation to Congress.", # neutral
           "is a really great politican", # positive
           "donated millions of dollars to charity!", # positive
           "want to take away human rights.", # negative
           "commited tax fraud", # negative
          ""]

Now that our function and phrases are defined, we can iterate this API call for all of the senators in our dataset.
We will do this by filling empty lists with the scores for each of the sentiment categories.

In [30]:
overall = []
positive = []
negative = []
mixed = []
neutral = []
text = []
names = []                
for phrase in phrases:
    for i in range(len(senators)):
        names.append(senators["name"][i])
        overall.append(sentiment_test(body = phrase, entity = str(senators["name"][i]), 
                                      option="overall"))
        positive.append(sentiment_test(body = phrase, entity = str(senators["name"][i]), option="positive"))
        negative.append(sentiment_test(body = phrase, entity = str(senators["name"][i]), option="negative"))
        mixed.append(sentiment_test(body = phrase, entity = str(senators["name"][i]), option="mixed"))
        neutral.append(sentiment_test(body = phrase, entity = str(senators["name"][i]), option="neutral"))
        text.append(phrase)

We can then merge these lists into a single dataframe called `scores` and use `pd.merge` to merge the scores with each Senator according to the `name` column. 

In [31]:
scores = pd.DataFrame({"name": names, "Text": text, "Overall": overall,"Positive":positive,"Negative":negative,"Mixed":mixed,"Neutral":neutral})
scores.head()

Unnamed: 0,name,Text,Overall,Positive,Negative,Mixed,Neutral
0,Amy Klobuchar,claims they want to protect children from the ...,NEUTRAL,0.013765,0.07438,0.004444,0.90741
1,Ben Sasse,claims they want to protect children from the ...,NEUTRAL,0.016394,0.043721,0.008574,0.93131
2,Benjamin L. Cardin,claims they want to protect children from the ...,NEUTRAL,0.012917,0.022653,0.000486,0.963944
3,Bill Cassidy,claims they want to protect children from the ...,NEUTRAL,0.010774,0.052979,0.002587,0.93366
4,Brian Schatz,claims they want to protect children from the ...,NEUTRAL,0.019647,0.020666,0.001563,0.958124


In [2]:
scores

NameError: name 'scores' is not defined

In [32]:
combined = pd.merge(scores, senators, on='name')
combined.head()

Unnamed: 0,name,Text,Overall,Positive,Negative,Mixed,Neutral,group,gender,Democrat,female
0,Amy Klobuchar,claims they want to protect children from the ...,NEUTRAL,0.013765,0.07438,0.004444,0.90741,Democrat,female,1.0,1.0
1,Amy Klobuchar,introduced new gun control legislation to Cong...,NEUTRAL,0.004816,0.001847,0.000103,0.993234,Democrat,female,1.0,1.0
2,Amy Klobuchar,is a really great politican,POSITIVE,0.979296,0.001517,0.003056,0.016131,Democrat,female,1.0,1.0
3,Amy Klobuchar,donated millions of dollars to charity!,NEUTRAL,0.105815,0.003947,0.000465,0.889773,Democrat,female,1.0,1.0
4,Amy Klobuchar,want to take away human rights.,NEUTRAL,0.005937,0.488358,0.000868,0.504837,Democrat,female,1.0,1.0


In [33]:
combined

Unnamed: 0,name,Text,Overall,Positive,Negative,Mixed,Neutral,group,gender,Democrat,female
0,Amy Klobuchar,claims they want to protect children from the ...,NEUTRAL,0.013765,0.074380,0.004444,0.907410,Democrat,female,1.0,1.0
1,Amy Klobuchar,introduced new gun control legislation to Cong...,NEUTRAL,0.004816,0.001847,0.000103,0.993234,Democrat,female,1.0,1.0
2,Amy Klobuchar,is a really great politican,POSITIVE,0.979296,0.001517,0.003056,0.016131,Democrat,female,1.0,1.0
3,Amy Klobuchar,donated millions of dollars to charity!,NEUTRAL,0.105815,0.003947,0.000465,0.889773,Democrat,female,1.0,1.0
4,Amy Klobuchar,want to take away human rights.,NEUTRAL,0.005937,0.488358,0.000868,0.504837,Democrat,female,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
681,Tom Udall,is a really great politican,POSITIVE,0.981198,0.000609,0.003153,0.015040,Democrat,male,1.0,0.0
682,Tom Udall,donated millions of dollars to charity!,NEUTRAL,0.197778,0.007483,0.003950,0.790789,Democrat,male,1.0,0.0
683,Tom Udall,want to take away human rights.,NEUTRAL,0.008663,0.368030,0.000374,0.622933,Democrat,male,1.0,0.0
684,Tom Udall,commited tax fraud,NEGATIVE,0.001361,0.654791,0.001282,0.342566,Democrat,male,1.0,0.0


Finally we can output the file as a CSV and upload it to our Github Repository.

In [34]:
combined.to_csv('senators_sentiment_fixed.csv', index=False)

### Generating Sentiments Scores for Random Names

First we will import our random names from our s3 bucket. This file contains a random selection of names and their corresponding gender from a public dataset created by the Social Security Administration. This dataset contains all names from Social Security card applications for births that occurred in the United States after 1879.

In [35]:
with fs.open('s3://comprehendproject-qtm350/random names.csv', 'rb') as f:
    random = pd.read_csv(f) # convert the csv file to a pandas dataframe
random.head() # preview the dataset

Unnamed: 0,name,gender
0,Gregory,M
1,Tyson,M
2,Arrianna,F
3,Hilda,F
4,Sharonda,F


We will need to recode the gender variable to a binary variable. 
1 will be coded to female and 0 will be coded to male, as in the Senators dataframe.

In [36]:
random.loc[random["gender"] == "F", 'female'] = 1
random.loc[random["gender"] != "F", 'female'] = 0
random.head()

Unnamed: 0,name,gender,female
0,Gregory,M,0.0
1,Tyson,M,0.0
2,Arrianna,F,1.0
3,Hilda,F,1.0
4,Sharonda,F,1.0


In [37]:
overall = []
positive = []
negative = []
mixed = []
neutral = []
text = []
names = []                
for phrase in phrases:
    for i in range(len(random)):
        names.append(random["name"][i])
        overall.append(sentiment_test(body = phrase, entity = str(random["name"][i]), 
                                      option="overall"))
        positive.append(sentiment_test(body = phrase, entity = str(random["name"][i]), option="positive"))
        negative.append(sentiment_test(body = phrase, entity = str(random["name"][i]), option="negative"))
        mixed.append(sentiment_test(body = phrase, entity = str(random["name"][i]), option="mixed"))
        neutral.append(sentiment_test(body = phrase, entity = str(random["name"][i]), option="neutral"))
        text.append(phrase)

We can then merge these lists into a single dataframe called `random_scores` and use `pd.merge` to merge the scores with gender according to the `name` column.

In [38]:
random_scores = pd.DataFrame({"name": names, "Text": text, "Overall": overall,"Positive":positive,"Negative":negative,"Mixed":mixed,"Neutral":neutral})
random_combined = pd.merge(random_scores, random, on='name')
random_combined.head()

Unnamed: 0,name,Text,Overall,Positive,Negative,Mixed,Neutral,gender,female
0,Gregory,claims they want to protect children from the ...,NEUTRAL,0.018272,0.041988,0.004749,0.934991,M,0.0
1,Gregory,introduced new gun control legislation to Cong...,NEUTRAL,0.015322,0.00405,0.000487,0.980141,M,0.0
2,Gregory,is a really great politican,POSITIVE,0.97007,0.001349,0.008348,0.020234,M,0.0
3,Gregory,donated millions of dollars to charity!,NEUTRAL,0.217226,0.011766,0.012721,0.758287,M,0.0
4,Gregory,want to take away human rights.,NEUTRAL,0.010364,0.430839,0.002182,0.556615,M,0.0


Finally we can output the file as a CSV and upload it to our Github Repository.

In [39]:
random_combined.to_csv('random_names_scores.csv', index=False)