# Week 8: Dataframe with all features by Juliette and Anke

At the end the df contains:

- Count of Adjectives and Adverbs. 
- Count of Entities.
- All TextFeatures. 
- Readability: The Flesch Reading Ease formula.
- Syllables count.
- SpellChecker.
- Punctuation count.
- Uppercase & Lowercase count. 
- Sentiment analysis.




In [143]:
import pandas as pd
import csv
import spacy
import re

    Creating the dataframe with a sample of 2000 tweets. 
    This has been done for showing the steps of creating variables.
    The analysis contains all of the tweets.

In [144]:
df = pd.read_csv('tweets_labeled.csv').sample(2000)

    Replacing binary labels with text:

In [145]:
df['label'].replace(0, 'Fake', inplace=True)
df['label'].replace(1, 'True', inplace=True)

In [146]:
df = df.drop(["tweet_id"], axis = 1)

# Anke: Cleaning tweets
I removed the mentions of retweets “RT @account_name" and the https addresses. 

In [147]:
for i in df.index:
    txt = df.loc[i]["text"]
    txt = re.sub(r"RT\ \@\w*\:\ ", '', txt) #replace RT-tags
    txt= re.sub(r'@[A-Z0-9a-z_:]+','',txt) #replace username-tags
    txt = re.sub('https?://[A-Za-z0-9./]+','',txt) #replace URLs
    df.at[i,"text"]=txt

# Anke: Adjectives and Adverbs
Using Python’s natural language processing (NLP) library spaCy I calculated the adjectives and adverbs; two grammatical concepts that are there to describe and modify nouns and verbs, respectively.
https://spacy.io/usage/spacy-101 

Part-of-speech (POS) Tagging	Assigning word types to tokens, like verb or noun.

Loading the model via spacy.load(). This will return a Language object containing all components and data needed to process text.

In [148]:
nlp = spacy.load("en_core_web_sm")

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name. 

- .pos = int = Coarse-grained part-of-speech from the Universal POS tag set.
- .pos_ = unicode = Coarse-grained part-of-speech from the Universal POS tag set.

Spacy is highly optimised and does the multiprocessing for you. So it is better to take the data out of a df.

In [149]:
#to use the spacy pipeline I created a list 
pos = []

for doc in nlp.pipe(df['text'].astype('unicode').values):
    if doc.is_parsed:
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the same number of entries of the original Dataframe.
        # Add blanks in case the parse fails
        pos.append(None)

df = df.assign(PartofSpeach = pos)

#### Alphabetical listing Universal POS tags
These tags mark the core part-of-speech categories. To distinguish additional lexical and grammatical properties of words, use the universal features.
source:https://universaldependencies.org/docs/u/pos/

- <b> ADJ: adjective </b>
- ADP: adposition
- <b> ADV: adverb </b>
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other

In [150]:
from collections import Counter

# all Universal POS tags can be entered here
grammarList = ['ADJ', 'ADV', 'ADP', 'AUX', 'CONJ', "DET", "INTJ", 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X']

for index, row in df.iterrows():
    c = Counter(([token.pos_ for token in nlp(row['text'])]))
    for el, cnt in c.items():
        if el in grammarList:
            df.loc[index, el] = cnt
    


# Anke: Entities

from: https://texthero.org/docs/api/texthero.nlp.named_entities
*named_entities(s, package='spacy')*

Return named-entities.
Return a Pandas Series where each rows contains a list of tuples containing information regarding the given named entities.

Tuple: *(entity’name, entity’label, starting character, ending character)*

Under the hood, named_entities make use of Spacy name entity recognition.

List of labels:
- PERSON: People, including fictional.
- NORP: Nationalities or religious or political groups.
- FAC: Buildings, airports, highways, bridges, etc.
- ORG : Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including ”%“.
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: “first”, “second”, etc.
- CARDINAL: Numerals that do not fall under another type.


In [151]:
import texthero as hero
import math

In [152]:
entities = hero.named_entities(df['text'])
df['entities'] = entities

pd.set_option('display.max_columns', None)  


In [153]:
entities_list = []

# loop through array
for index, row in df.iterrows():
    # loop through object
    for item in row['entities']:
        # item [1] is second item in object so the entitie
        if item[1] not in entities_list:
            # add to entities_list
            entities_list.append(item[1])
            # create column and set first appearance of entitie to 1
            df.loc[index, item[1]] = 1
        # change NaN to 1  
        # math checks if value is NaN
        elif math.isnan(df.loc[index, item[1]]):
            df.loc[index, item[1]] = 1
        # add to number
        elif item[1] in entities_list and not math.isnan(df.loc[index, item[1]]):
            df.loc[index, item[1]] += 1



# Everyone: Textfeatures

Textfeatures is a python package which extracts the basic features from the text data such as hashtags, stopwords, numerics. 

In this file we will inspect the dataset from an AI research project using: 

1. word_count():- give the total words count present in text data.

2. char_count():- give the characters count.

3. avg_word_length():- give the average word length.

4. stopwords_count():- give the stopwords count.

5. stopwords():- extract the stopwords from the text data.

6. hashtags_count():- give the hashtags count.

7. hashtags():- extract the hashtags from text data.

8. numeric_count():- give the numeric digits count.

9. user_mentions_count():- give the user mentions count from text data.

10. user_mentions():- extract the user mentions from text data.

11. clean():- give the pre-processed data after removal for unnecessary material in text data.

In [154]:
import textfeatures as tf

tf.word_count(df,"text","word_count")
tf.char_count(df,"text","char_count")
tf.avg_word_length(df,"text","avg_word_length")
tf.stopwords_count(df,"text","stopwords_count")
tf.stopwords(df,"text","stopwords")
tf.hashtags_count(df,"text","hashtags_count")
tf.hashtags(df,"text","hashtags")
tf.numerics_count(df,"text","num_count")
tf.user_mentions_count(df,"text","user_mentions_count")
tf.user_mentions(df,"text","user_mentions")
tf.clean(df,"text","clean_text")


Unnamed: 0,text,label,PartofSpeach,PUNCT,PROPN,VERB,AUX,DET,NOUN,ADP,PRON,ADJ,PART,NUM,SCONJ,ADV,X,INTJ,SYM,entities,CARDINAL,ORG,GPE,PERSON,MONEY,WORK_OF_ART,NORP,EVENT,TIME,PERCENT,FAC,DATE,ORDINAL,LAW,LOC,PRODUCT,QUANTITY,LANGUAGE,word_count,char_count,avg_word_length,stopwords_count,stopwords,hashtags_count,hashtags,num_count,user_mentions_count,user_mentions,clean_text
155520,'#MoscowMitch should be the focus of everyone ...,True,"[PUNCT, PROPN, PROPN, VERB, AUX, DET, NOUN, AD...",5.0,5.0,3.0,1.0,3.0,2.0,3.0,2.0,1.0,,,,,,,,"[(#, CARDINAL, 1, 2), (MoscowMitch, ORG, 2, 13...",1.0,2.0,1.0,,,,,,,,,,,,,,,,20,121,5.368421,9,"[should, be, the, of, who, about, the, of, the]",0,[],0,0,[],moscowmitch focus everyone cares america forge...
210971,'Not sure my grandparents would’ve been allowe...,True,"[PUNCT, PART, ADJ, DET, NOUN, VERB, VERB, AUX,...",4.0,,6.0,1.0,2.0,4.0,1.0,1.0,1.0,2.0,,,,,,,[],,,,,,,,,,,,,,,,,,,19,120,5.368421,6,"[my, been, to, by, this, and]",0,[],0,0,[],sure grandparents wouldve allowed enter measur...
186412,'Excellent piece: 👇🏻“CIA Prepares the Blame th...,Fake,"[PUNCT, ADJ, NOUN, PUNCT, NUM, NOUN, VERB, DET...",5.0,4.0,2.0,,4.0,4.0,2.0,,1.0,,1.0,,,,,,"[(👇, CARDINAL, 18, 19)]",1.0,,,,,,,,,,,,,,,,,,18,116,5.500000,5,"[the, the, for, the, the]",0,[],0,0,[],excellent piece prepares blame consultant excu...
206715,'Illegal Immigration Expected to Hit Highest L...,Fake,"[PUNCT, ADJ, NOUN, VERB, PART, VERB, ADJ, PROP...",2.0,4.0,2.0,,,1.0,,,2.0,1.0,,1.0,,,,,"[(George W. Bush', PERSON, 57, 72)]",,,,1.0,,,,,,,,,,,,,,,11,72,5.636364,1,[to],0,[],0,0,[],illegal immigration expected highest level sin...
207516,'Previously Deported Sex Offenders Hiding in M...,Fake,"[PUNCT, ADV, VERB, NOUN, NOUN, PROPN, ADP, PRO...",3.0,4.0,2.0,,,2.0,1.0,,,,,,1.0,,,,[],,,,,,,,,,,,,,,,,,,12,72,5.545455,1,[in],0,[],0,0,[],previously deported offenders hiding migrant g...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106049,'Donald Trump Jr.: Covington Catholic Hoax Sho...,Fake,"[PUNCT, PROPN, PROPN, PROPN, PUNCT, PROPN, PRO...",4.0,11.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,,,"[('Donald Trump Jr., PERSON, 0, 17), (Covingto...",,,1.0,2.0,,,,,,,,,,,,,,,19,105,4.578947,2,"[the, of]",0,[],0,0,[],donald trump covington catholic hoax shows rea...
29130,"'Ugh, too real '",Fake,"[PUNCT, NOUN, PUNCT, ADV, ADJ, PUNCT]",3.0,,,,,1.0,,,1.0,,,,1.0,,,,[],,,,,,,,,,,,,,,,,,,4,16,3.250000,1,[too],0,[],0,0,[],real
8357,'A newly unearthed letter from 2016 shows that...,True,"[PUNCT, DET, ADV, VERB, NOUN, ADP, NUM, VERB, ...",3.0,1.0,3.0,,1.0,5.0,4.0,,1.0,2.0,1.0,1.0,1.0,,,,"[(2016, DATE, 31, 35), (Republican, NORP, 47, ...",,,1.0,,,,1.0,,,,,1.0,,,,,,,18,124,5.944444,4,"[from, that, for, to]",0,[],1,0,[],newly unearthed letter shows republican senato...
137882,'Feds Resettle 224K Border Crossers Across U.S...,Fake,"[PUNCT, PROPN, NOUN, NUM, PROPN, PROPN, PROPN,...",2.0,7.0,,,1.0,2.0,,,,,1.0,,,,,,[],,,,,,,,,,,,,,,,,,,10,66,5.700000,0,[],0,[],0,0,[],feds resettle border crossers across every hal...


# Juliette: The Flesch Reading Ease formula

In [155]:
import textstat
import pandas as pd
from spellchecker import SpellChecker
import re

    Textstat has been used, which is a library to calculate statistics from text. 
    It helps determine readability, complexity, and grade level.
    
    The following formula returns the Flesch Reading Ease Score (FRES). 
    It scores on the readability of the document.
    
    While the maximum score is 121.22, there is no limit on how low the score can be. 
    A negative score is valid.
    
The documentation can be found on https://pypi.org/project/textstat/.

    Score Difficulty
    90-100 Very Easy
    80-89 Easy
    70-79 Fairly Easy
    60-69 Standard
    50-59 Fairly Difficult
    30-49 Difficult
    0-29 Very Confusing

Flesch, Rudolf. "How to Write Plain English". University of Canterbury. Archived from the original on July 12, 2016. Retrieved July 12, 2016. https://web.archive.org/web/20160712094308/http://www.mang.canterbury.ac.nz/writing_guide/writing/flesch.shtml

In [156]:
df['FRES']=df['text'].apply(textstat.flesch_reading_ease).astype(int)

## Juliette: Syllables count

    With textstat, I have also counted syllables in the text. 
    The following formula returns the number of syllables present in the given text.

In [157]:
df['Syllables']=df['text'].apply(textstat.syllable_count)

## Juliette: Sentences count

    Textstat also has a possibility to count the sentences in the text. 

In [158]:
df['Sentences']=df['text'].apply(textstat.sentence_count)

# Juliette: SpellChecker

    With pyspellchecker, I performed Pure Python Spell Checking based on Peter Norvig’s blog post.
    It uses a Levenshtein Distance algorithm to find permutations,
    within an edit distance of 2 from the original word. 
    It then compares all permutations (insertions, deletions, replacements, and transpositions) 
    to known words in a word frequency list. 
    Those words that are found more often in the frequency list are more likely the correct results.
    
Documentation: https://pypi.org/project/pyspellchecker/

In [159]:
spell = SpellChecker()
df['Spelling_Mistakes']=df['text'].apply(spell.unknown)

# Juliette: punctuation count

    For the punctuation count, I used a code that I found on stack overflow: 
https://stackoverflow.com/questions/58252056/count-punctuation-in-a-dataframe-column

    First, I count the number of punctuations:

In [160]:
import string

In [161]:
count = lambda l1,l2: sum([1 for x in l1 if x in l2])

df['count_punct'] = df.text.apply(lambda s: count(s, string.punctuation))


    Then I create a list where all the punctuations of a tweet are accumulated:

In [162]:
accumulate = lambda l1,l2: [x for x in l1 if x in l2]

df['acc_punct_list'] = df.text.apply(lambda s: accumulate(s, string.punctuation))

    The next step is to create a dictionary that counts the occurences of each punctuation:

In [163]:
df['acc_punct_dict'] = df.text.apply(lambda s: {k:v for k, v in Counter(s).items() if k in string.punctuation})

    Now the punctuations will each be given a column of their own:

In [164]:
df_punct = df.acc_punct_dict.apply(pd.Series)

    And at last, the column will be merged with the entire dataframe:

In [165]:
df = pd.concat([df, df_punct], axis=1)

# Juliette: Uppercase & Lowercase

    To count the uppercases and lowercases in the tweets, I found a piece of code in stack overflow.
https://stackoverflow.com/questions/49230262/how-to-count-uppercase-and-lowercase-on-pandas-dataframe

    It finds all the upper- or lowercases, puts them in a list, and then counts the length of the list.

In [166]:
df['Uppercase'] = df['text'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['text'].str.findall(r'[a-z]').str.len()

    The dataframe of my added features will look like this:

In [167]:
df.iloc[:, 50:82]

Unnamed: 0,Syllables,Sentences,Spelling_Mistakes,count_punct,acc_punct_list,acc_punct_dict,',#,.,",",:,-,!,$,|,?,/,%,"""",(,&,=,@,),\,[,],*,+,Uppercase,Lowercase
155520,29,1,"{…, }",5,"[', #, ., ,, ']","{''': 2, '#': 1, '.': 1, ',': 1}",2.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,6,90
210971,28,2,"{’, , …}",3,"[', ., ']","{''': 2, '.': 1}",2.0,,1.0,,,,,,,,,,,,,,,,,,,,,2,95
186412,26,1,"{…, , “, 🏻, 👇, ”}",3,"[', :, ']","{''': 2, ':': 1}",2.0,,,,1.0,,,,,,,,,,,,,,,,,,,13,78
206715,20,1,{ },3,"[', ., ']","{''': 2, '.': 1}",2.0,,1.0,,,,,,,,,,,,,,,,,,,,,10,49
207516,21,1,{ },3,"[', ,, ']","{''': 2, ',': 1}",2.0,,,1.0,,,,,,,,,,,,,,,,,,,,9,49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106049,27,1,"{…, }",5,"[', ., :, |, ']","{''': 2, '.': 1, ':': 1, '|': 1}",2.0,,1.0,,1.0,,,,1.0,,,,,,,,,,,,,,,15,66
29130,5,1,{ },3,"[', ,, ']","{''': 2, ',': 1}",2.0,,,1.0,,,,,,,,,,,,,,,,,,,,1,9
8357,32,1,"{…, }",4,"[', ', ', ']",{''': 4},4.0,,,,,,,,,,,,,,,,,,,,,,,3,95
137882,14,2,{ },4,"[', ., ., ']","{''': 2, '.': 2}",2.0,,2.0,,,,,,,,,,,,,,,,,,,,,11,39


# Nadia: Sentiment Analysis

In [168]:
import sklearn as sk
import pandas as pd
import textfeatures as tf
import texthero as hero
from textblob import TextBlob
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [169]:
analyzer = SentimentIntensityAnalyzer()
df['scores'] = df['text'].apply(lambda text: analyzer.polarity_scores(text))

In [170]:
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])

In [171]:
df['pos']  = df['scores'].apply(lambda score_dict: score_dict['pos'])
df['neg']  = df['scores'].apply(lambda score_dict: score_dict['neg'])
df['neu']  = df['scores'].apply(lambda score_dict: score_dict['neu'])

In [172]:
df.to_csv('TextfeaturesTweets2.csv')