# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data 
2. Cleaning the data 
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

## Problem Statement

To analyze biographical information of various politicians and identify similarities and differences in their backgrounds, career trajectories, and key attributes. Additionally, determine if a selected politician has a distinctive style or approach compared to other politicians.

## Getting The Data

You can get the biography of politicians from [Biography online](https://www.biographyonline.net/politicians). 



In [1]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    '''Returns transcript data specifically from biographyonline.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find(class_="post-content")]
    print(url)
    return text

# URLs of transcripts in scope
urls = ['https://www.biographyonline.net/abraham-lincoln.html',
        'https://www.biographyonline.net/politicians/winston_churchill.html',
        'https://www.biographyonline.net/business/donald-trump.html',
        'https://www.biographyonline.net/politicians/uk/boris-johnson.html',
        'https://www.biographyonline.net/politicians/asia/b-r-ambedkar-biography.html',
        'https://www.biographyonline.net/politicians/nelson-mandela.html',
        'https://www.biographyonline.net/politicians/uk/margaret-thatcher.html',
        'https://www.biographyonline.net/politicians/american/george-washington.html',
        'https://www.biographyonline.net/politicians/american/franklin-roosevelt.html',
        'https://www.biographyonline.net/spiritual/desmond-tutu.html/',
        'https://www.biographyonline.net/politicians/europe/willy-brandt.html',
        'https://www.biographyonline.net/politicians/indian/gandhi.html']

# Politician names
politicians = ['abraham', 'winston', 'trump', 'boris', 'ambedkar', 'mandela',
             'margaret', 'washington', 'roosevelt', 'desmond', 'brandt', 'gandhi']

In [2]:
# Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]

https://www.biographyonline.net/abraham-lincoln.html
https://www.biographyonline.net/politicians/winston_churchill.html
https://www.biographyonline.net/business/donald-trump.html
https://www.biographyonline.net/politicians/uk/boris-johnson.html
https://www.biographyonline.net/politicians/asia/b-r-ambedkar-biography.html
https://www.biographyonline.net/politicians/nelson-mandela.html
https://www.biographyonline.net/politicians/uk/margaret-thatcher.html
https://www.biographyonline.net/politicians/american/george-washington.html
https://www.biographyonline.net/politicians/american/franklin-roosevelt.html
https://www.biographyonline.net/spiritual/desmond-tutu.html/
https://www.biographyonline.net/politicians/europe/willy-brandt.html
https://www.biographyonline.net/politicians/indian/gandhi.html


In [3]:
# Pickle files for later use

# Make a new directory to hold the text files
!mkdir transcripts

for i, c in enumerate(politicians):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

A subdirectory or file transcripts already exists.


In [4]:
# Load pickled files
data = {}
for i, c in enumerate(politicians):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [5]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['abraham', 'winston', 'trump', 'boris', 'ambedkar', 'mandela', 'margaret', 'washington', 'roosevelt', 'desmond', 'brandt', 'gandhi'])

In [6]:
# More checks
data['gandhi'][:2]

['Mahatma Gandhi was a prominent Indian political leader who was a leading figure in the campaign for Indian independence. He employed non-violent principles and peaceful disobedience as a means to achieve his goal. He was assassinated in 1948, shortly after achieving his life goal of Indian independence. In India, he is known as ‘Father of the Nation’.',
 '“When I despair, I remember that all through history the ways of truth and love have always won. There have been tyrants, and murderers, and for a time they can seem invincible, but in the end they always fall. Think of it–always.”']

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.
### Assignment:
1. Perform the following data cleaning on transcripts:
i) Make text all lower case
ii) Remove punctuation
iii) Remove numerical values
iv) Remove common non-sensical text (/n)
v) Tokenize text
vi) Remove stop words

In [7]:
# Let's take a look at our data again
next(iter(data.keys()))

'abraham'

In [8]:
# Notice that our dictionary is currently in key: politician, value: list of text format
next(iter(data.values()))

['“With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work we are in; to bind up the nation’s wounds…. ”',
 '– Abraham Lincoln',
 'Abraham Lincoln was born Feb 12, 1809, in a single-room log cabin, Hardin County, Kentucky. His family upbringing was modest; his parents from Virginia were neither wealthy or well known. At an early age, the young Abraham lost his mother, and his father moved away to Indiana. Abraham had to work hard splitting logs and other manual labour. But, he also had a thirst for knowledge and worked very hard to excel in his studies. This led him to become self-trained as a lawyer. He spent eight years working on the Illinois court circuit; his ambition, drive, and capacity for hard work were evident to all around him. Lincoln became respected on the legal circuit and he gained the nickname ‘Honest Abe.’ He often encouraged neighbours to mediate their own conflicts rather than p

In [9]:
# We are going to change this to key: politician, value: string format
def combine_biographies(list_of_biographies):
    '''Takes a list of biographies and combines them into one large chunk of text.'''
    combined_biographies = ' '.join(list_of_biographies)
    return combined_biographies


In [10]:
# Combine it!
data_combined = {key: [combine_biographies(value)] for (key, value) in data.items()}

In [11]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
abraham,"“With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work ..."
ambedkar,Dr B.R. Ambedkar (1891 – 1956) \n\n\n\n\n\n\n\nB.R. Ambedkar “Babasaheb” was an Indian political reformer who campaigned for the rights of the ‘un...
boris,"Boris Johnson is a leading Conservative politician and British Prime Minister, who was elected leader of the Conservative Party in the summer of 2..."
brandt,Willy Brandt (1913-1992) – German statesman and politician. Willy Brandt was a German left-wing politician who fled Nazi persecution in 1933. Afte...
desmond,"\n\n\n\n\n\n\n\nDesmond Mpilo Tutu (1931 – 2021) was born in Klerksdorp, Transvaal 7 October 1931 in South Africa. As a vocal and committed oppone..."
gandhi,Mahatma Gandhi was a prominent Indian political leader who was a leading figure in the campaign for Indian independence. He employed non-violent p...
mandela,\n \n\n\n\nNelson Mandela (1918 – 2013) was a South African political activist who spent over 20 years in prison for his opposition to the aparthe...
margaret,"\n\n\n\n\n\n\n\nMargaret Thatcher (1925-2013) was Britain’s first female prime minister (1979-90). She was known for her tough uncompromising, con..."
roosevelt,"\n\n\n\n\n\n\n\nFranklin Delano Roosevelt (January 30, 1882 – April 12, 1945), often referred to by his initials FDR, was the thirty-second Presid..."
trump,"Donald Trump (1946 – ) is the 45th President of the US. For many years he was chairman and president of the Trump Organisation, which has a divers..."


In [12]:
# Let's take a look at the transcript for Boris Johnson
data_df.transcript.loc['boris']

'Boris Johnson is a leading Conservative politician and British Prime Minister, who was elected leader of the Conservative Party in the summer of 2019, in a bid to take the UK out of the EU with or without a deal. He served as Mayor of London for two terms 2008-16, overseeing the 2012 London Olympics. He also played a leading role in the 2016 “Vote Leave” campaign on the EU referendum, afterwards becoming Foreign Secretary and later Prime Minister. He is one of Britain’s most high profile politicians, renowned for his eccentric approach to life but increasingly known for his hardline Brexit stance which has polarised opinion. Johnson’s term has Prime Minister was overshadowed by the coronavirus crisis. In 2021/22, details emerged that unauthorised parties had taken place in number 10 Downing Street – when the rest of the country was in lockdown. Early life of Boris Johnson Boris Johnson was born on 19th June 1964. His full name is Alexander Boris de Pfeffel Johnson but chooses to use t

In [13]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [14]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean  

Unnamed: 0,transcript
abraham,“with malice toward none with charity for all with firmness in the right as god gives us to see the right let us strive on to finish the work we a...
ambedkar,dr br ambedkar – \n\n\n\n\n\n\n\nbr ambedkar “babasaheb” was an indian political reformer who campaigned for the rights of the ‘untouchable’ cas...
boris,boris johnson is a leading conservative politician and british prime minister who was elected leader of the conservative party in the summer of i...
brandt,willy brandt – german statesman and politician willy brandt was a german leftwing politician who fled nazi persecution in after the second world...
desmond,\n\n\n\n\n\n\n\ndesmond mpilo tutu – was born in klerksdorp transvaal october in south africa as a vocal and committed opponent of apartheid i...
gandhi,mahatma gandhi was a prominent indian political leader who was a leading figure in the campaign for indian independence he employed nonviolent pri...
mandela,\n \n\n\n\nnelson mandela – was a south african political activist who spent over years in prison for his opposition to the apartheid regime he...
margaret,\n\n\n\n\n\n\n\nmargaret thatcher was britain’s first female prime minister she was known for her tough uncompromising conservative political vi...
roosevelt,\n\n\n\n\n\n\n\nfranklin delano roosevelt january – april often referred to by his initials fdr was the thirtysecond president of the united s...
trump,donald trump – is the president of the us for many years he was chairman and president of the trump organisation which has a diverse range of b...


In [15]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [16]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
abraham,with malice toward none with charity for all with firmness in the right as god gives us to see the right let us strive on to finish the work we ar...
ambedkar,dr br ambedkar – br ambedkar babasaheb was an indian political reformer who campaigned for the rights of the untouchable caste of india he playe...
boris,boris johnson is a leading conservative politician and british prime minister who was elected leader of the conservative party in the summer of i...
brandt,willy brandt – german statesman and politician willy brandt was a german leftwing politician who fled nazi persecution in after the second world...
desmond,desmond mpilo tutu – was born in klerksdorp transvaal october in south africa as a vocal and committed opponent of apartheid in south africa t...
gandhi,mahatma gandhi was a prominent indian political leader who was a leading figure in the campaign for indian independence he employed nonviolent pri...
mandela,nelson mandela – was a south african political activist who spent over years in prison for his opposition to the apartheid regime he was relea...
margaret,margaret thatcher was britains first female prime minister she was known for her tough uncompromising conservative political views and became du...
roosevelt,franklin delano roosevelt january – april often referred to by his initials fdr was the thirtysecond president of the united states he served ...
trump,donald trump – is the president of the us for many years he was chairman and president of the trump organisation which has a diverse range of b...


## Organizing The Data

### Assignment:
1. Organized data in two standard text formats:
   a) Corpus - corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.
   b) Document-Term Matrix - word counts in matrix format

### Corpus: Example

A corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [17]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
abraham,"“With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work ..."
ambedkar,Dr B.R. Ambedkar (1891 – 1956) \n\n\n\n\n\n\n\nB.R. Ambedkar “Babasaheb” was an Indian political reformer who campaigned for the rights of the ‘un...
boris,"Boris Johnson is a leading Conservative politician and British Prime Minister, who was elected leader of the Conservative Party in the summer of 2..."
brandt,Willy Brandt (1913-1992) – German statesman and politician. Willy Brandt was a German left-wing politician who fled Nazi persecution in 1933. Afte...
desmond,"\n\n\n\n\n\n\n\nDesmond Mpilo Tutu (1931 – 2021) was born in Klerksdorp, Transvaal 7 October 1931 in South Africa. As a vocal and committed oppone..."
gandhi,Mahatma Gandhi was a prominent Indian political leader who was a leading figure in the campaign for Indian independence. He employed non-violent p...
mandela,\n \n\n\n\nNelson Mandela (1918 – 2013) was a South African political activist who spent over 20 years in prison for his opposition to the aparthe...
margaret,"\n\n\n\n\n\n\n\nMargaret Thatcher (1925-2013) was Britain’s first female prime minister (1979-90). She was known for her tough uncompromising, con..."
roosevelt,"\n\n\n\n\n\n\n\nFranklin Delano Roosevelt (January 30, 1882 – April 12, 1945), often referred to by his initials FDR, was the thirty-second Presid..."
trump,"Donald Trump (1946 – ) is the 45th President of the US. For many years he was chairman and president of the Trump Organisation, which has a divers..."


In [18]:
# Let's add the Politicians full names as well
full_names = ['Abraham Lincoln', 'BR Ambedkar', 'Boris Johnson', 'Willy Brandt' , 'Desmond Tutu', 'Mahatama Gandhi',
             'Nelson Mandela', 'Margaret Thatcher', 'Franklin Roosevelt', 'Donald Trump', 'George Washington', 'Winston Churchill']


data_df['full_name'] = full_names
data_df

Unnamed: 0,transcript,full_name
abraham,"“With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work ...",Abraham Lincoln
ambedkar,Dr B.R. Ambedkar (1891 – 1956) \n\n\n\n\n\n\n\nB.R. Ambedkar “Babasaheb” was an Indian political reformer who campaigned for the rights of the ‘un...,BR Ambedkar
boris,"Boris Johnson is a leading Conservative politician and British Prime Minister, who was elected leader of the Conservative Party in the summer of 2...",Boris Johnson
brandt,Willy Brandt (1913-1992) – German statesman and politician. Willy Brandt was a German left-wing politician who fled Nazi persecution in 1933. Afte...,Willy Brandt
desmond,"\n\n\n\n\n\n\n\nDesmond Mpilo Tutu (1931 – 2021) was born in Klerksdorp, Transvaal 7 October 1931 in South Africa. As a vocal and committed oppone...",Desmond Tutu
gandhi,Mahatma Gandhi was a prominent Indian political leader who was a leading figure in the campaign for Indian independence. He employed non-violent p...,Mahatama Gandhi
mandela,\n \n\n\n\nNelson Mandela (1918 – 2013) was a South African political activist who spent over 20 years in prison for his opposition to the aparthe...,Nelson Mandela
margaret,"\n\n\n\n\n\n\n\nMargaret Thatcher (1925-2013) was Britain’s first female prime minister (1979-90). She was known for her tough uncompromising, con...",Margaret Thatcher
roosevelt,"\n\n\n\n\n\n\n\nFranklin Delano Roosevelt (January 30, 1882 – April 12, 1945), often referred to by his initials FDR, was the thirty-second Presid...",Franklin Roosevelt
trump,"Donald Trump (1946 – ) is the 45th President of the US. For many years he was chairman and president of the Trump Organisation, which has a divers...",Donald Trump


In [19]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix: Example

For many of the techniques we'll be using in future assignment, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's ` CountVectorizer `, where every row will represent a different document and every column will represent a different word.

In addition, with ` CountVectorizer `, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [20]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,abandoning,abe,abhorred,ability,abitur,able,ably,abolish,abolitionists,abomination,...,young,youngster,youre,yousafzai,youth,zeal,zealand,zelnickova,zone,zulu
abraham,0,1,1,1,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
ambedkar,0,0,0,0,0,0,1,0,0,0,...,2,0,0,0,0,0,0,0,0,0
boris,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
brandt,0,0,0,0,1,2,0,0,0,0,...,2,0,0,0,1,1,0,0,0,0
desmond,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
gandhi,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,1
mandela,0,0,0,1,0,1,0,0,0,0,...,1,1,0,1,0,0,1,0,0,0
margaret,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
roosevelt,0,0,0,0,0,2,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
trump,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [21]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [22]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional Assignments:

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?

In [23]:
def round3(text):
    text = re.sub('[‘’“”…]', '', text)  # Remove additional punctuation
    text = re.sub('\n', ' ', text)  # Replace newline characters with space
    text = re.sub(r'\b(hon|dr|mr|mrs|prof|miss|ms)\b', '', text)  # Remove common titles
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remove short words (1 or 2 characters)
    text = re.sub(r'\b\w{20,}\b', '', text)  # Remove very long words (more than 20 characters)
    text = re.sub(r'\b\w*([a-zA-Z])\1{2,}\w*\b', '', text)  # Remove words with consecutive repeated characters (e.g., "soooo", "funnyyyy")
    #text = re.sub(r'\b(president|senator|congressman|congresswoman|leader|representative)\b', '', text, flags=re.IGNORECASE)  # Remove common political titles
    #text = re.sub(r'\b(government|politics|political|policy|election)\b', '', text, flags=re.IGNORECASE)  # Remove common political terms
    text = re.sub(r'\b\d+\b', '', text)  # Remove standalone numbers
    text = re.sub(r'\b\w*([a-zA-Z])\1{2,}\w*\b', '', text)  # Remove words with consecutive repeated characters
    text = re.sub(r'\b(www|http|https)\S+\b', '', text)  # Remove URLs
    text = re.sub(r'\b\w*\d\w*\b', '', text)# Remove words with digits
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub('copyright', '', text, flags=re.IGNORECASE)# Remove 'copyright' (case-insensitive)

    return text


data_clean = pd.DataFrame(data_clean.transcript.apply(round3))
data_clean

Unnamed: 0,transcript
abraham,with malice toward none with charity for all with firmness the right god gives see the right let strive finish the work are bind the na...
ambedkar,ambedkar ambedkar babasaheb was indian political reformer who campaigned for the rights the untouchable caste india played role the in...
boris,boris johnson leading conservative politician and british prime minister who was elected leader the conservative party the summer bid tak...
brandt,willy brandt german statesman and politician willy brandt was german leftwing politician who fled nazi persecution after the second world war...
desmond,desmond mpilo tutu was born klerksdorp transvaal october south africa vocal and committed opponent apartheid south africa tutu was awar...
gandhi,mahatma gandhi was prominent indian political leader who was leading figure the campaign for indian independence employed nonviolent principle...
mandela,nelson mandela was south african political activist who spent over years prison for his opposition the apartheid regime was released ...
margaret,margaret thatcher was britains first female prime minister she was known for her tough uncompromising conservative political views and became du...
roosevelt,franklin delano roosevelt january april often referred his initials fdr was the thirtysecond president the united states served through t...
trump,donald trump the president the for many years was chairman and president the trump organisation which has diverse range business and re...


In [24]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Set max_df, min_df, ngram_range, and max_features parameters
max_df = 0.80     # Exclude terms that appear in more than 95% of the documents
min_df = 2       # Exclude terms that appear in fewer than 2 documents
ngram_range = (1,3)  # Consider both unigrams and Trigrams
max_features = 500  # Maximum number of features to include in the vocabulary

# Initialize CountVectorizer with stop_words and max_df/min_df parameters
cv = CountVectorizer(stop_words='english', max_df=max_df, min_df=min_df, ngram_range=ngram_range, max_features=max_features)

# Fit and transform the cleaned data
data_cv = cv.fit_transform(data_clean.transcript)

# Convert the CountVectorizer output to a DataFrame
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index

# Display the resulting document-term matrix
data_dtm

Unnamed: 0,able,abraham,abraham lincoln,act,action,active,activist,activists,address,africa,...,worked,working,world famous,world famous people,world war,worth,wrote,year,york,young
abraham,1,11,8,0,1,0,1,1,4,0,...,1,2,0,0,0,0,0,1,1,1
ambedkar,0,0,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,2,0,3,2
boris,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,2,0,0,1
brandt,2,0,0,3,4,0,3,1,0,0,...,2,2,0,0,3,0,2,2,0,2
desmond,1,1,1,1,0,0,0,0,0,9,...,0,0,0,0,0,0,1,0,1,0
gandhi,0,0,0,0,1,0,0,0,0,7,...,1,0,1,1,0,0,0,0,0,1
mandela,1,0,0,1,0,2,1,0,0,17,...,0,0,1,1,0,0,0,1,0,1
margaret,0,0,0,0,0,1,0,0,0,0,...,1,1,1,0,0,1,1,0,0,0
roosevelt,2,1,1,0,0,0,1,1,1,0,...,1,0,1,1,5,0,0,0,2,0
trump,1,1,1,1,0,1,0,1,1,0,...,1,0,0,0,0,4,0,1,2,0
