## Data for training and testing

This notebook takes the data obtained from [SEPLN](http://tass.sepln.org/tass_data/download.php) and create two data sets: training and testing.


## Libraries and dependencies

In [1]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as et 
import re

### Training

In [3]:
file = "A_RawData/general-train-tagged.xml"
xtree = et.parse(file)
xroot = xtree.getroot()

In [5]:
# Defining the colnames of the dataframe
df_cols = [ "tweet" , "sentiment" ]
# I will collect every observation from the key tweet
rows = []
# I will go over each subsection of the key tweet
for node in xroot:
    
    # Collecting the content of the tweet
    s_tweet = node.find( "content" ).text
    
    # Collecting the sentiment of the tweet
    # It has a subsectio like polarity
    # It has value and type
    s_pv= node.findall( "sentiments" )
    
    # Empty list
    value=[]
    # s_pv has two information sources value and type from polarity
    # We will get from polarity the value and append observations 
    # to the value list
    for item in s_pv:
        # Getting all information in polarity key
        test=item.findall( "polarity" )
        # Getting the values
        value.append(test[ 0 ].find( "value" ).text)
    # Append into row list a dict conformed by the tweet message (s_tweet)
    # And the sentiment value
    rows.append({ "tweet" : s_tweet, "sentiment" : value})

# Generating the dataframe
training = pd.DataFrame( rows, columns = df_cols )

In [6]:
training.head()

Unnamed: 0,tweet,sentiment
0,"Salgo de #VeoTV , que día más largoooooo...",[NONE]
1,@PauladeLasHeras No te libraras de ayudar me/n...,[NEU]
2,@marodriguezb Gracias MAR,[P]
3,"Off pensando en el regalito Sinde, la que se v...",[N+]
4,Conozco a alguien q es adicto al drama! Ja ja ...,[P+]


**Cleaning the data frame**

In [11]:
# It will take every observation and chekc if it is a list.
# If it is a list, we will get the first value and if not
# It will return the same value
training2 = training.applymap(lambda x: x[0] if isinstance(x, list) else x)
# Change datatype of sentiment column to str
training2.sentiment = training2.sentiment.astype(str)
# Print the counts of unique values of the sentiment list
print(training2["sentiment"].value_counts())

P+      1652
NONE    1483
N       1335
P       1232
N+       847
NEU      670
Name: sentiment, dtype: int64


Droping the 'NONE' from sentiment

In [15]:
training3 = training2[training2["sentiment"]!="NONE"].copy().reset_index().drop('index', axis = 1)

In [16]:
print(training3["sentiment"].value_counts())

P+     1652
N      1335
P      1232
N+      847
NEU     670
Name: sentiment, dtype: int64


Recoding the sentiment to have three categories: 
 - '0' Negative
 - '1' Neutral
 - '2' Positive

In [30]:
# Changing sentiment variable to values
training3["sentiment"]=training3["sentiment"].replace("P+", 2)
training3["sentiment"]=training3["sentiment"].replace("NEU", 1)
training3["sentiment"]=training3["sentiment"].replace("N", 0)
training3["sentiment"]=training3["sentiment"].replace("N+", 0)
training3["sentiment"]=training3["sentiment"].replace("P", 2)
print(training3.shape)
print(training3["sentiment"].value_counts())
print(training3.head())

(5736, 2)
2    2884
0    2182
1     670
Name: sentiment, dtype: int64
                                               tweet  sentiment
0  @PauladeLasHeras No te libraras de ayudar me/n...          1
1                          @marodriguezb Gracias MAR          2
2  Off pensando en el regalito Sinde, la que se v...          0
3  Conozco a alguien q es adicto al drama! Ja ja ...          2
4  Toca @crackoviadeTV3 . Grabación dl especial N...          2


Saving the data frame

In [19]:
training3.to_excel("B_OutputData/trainingTASS_data.xlsx", index = False)

### Testing data

I follow the same steps from above. The training sample is also from TASS.

In [33]:
# file2="../../Data/Downloaded/TASS_data/politics-test-tagged.xml"
# Parsing data
file2 = "A_RawData/politics-test-tagged.xml"
xtree2 = et.parse(file2)
xroot2 = xtree2.getroot()

# Generating columns for dataframe
df_cols = [ "tweet" , "sentiment" ]
rows = []

# We will go over each subsection of the key tweet
for node in xroot2: 
    
    # Collecting the content of the tweet
    s_tweet = node.find( "content" ).text
    
    # Collecting the sentiment of the tweet
    # It has a subsectio like polarity
    # It has value and type
    s_pv= node.findall( "sentiments" )
    
    # Empty list
    value=[]
    
    # s_pv has two information sources value and type from polarity
    # We will get from polarity the value and append observations 
    # to the value list
    for item in s_pv:
        
        # Getting all information in polarity key
        test=item.findall( "polarity" )
        # Getting values
        value1=test[ 0 ].find( "value" ).text
        value.append( value1 )
    
    # appending values to the list row
    rows.append( { "tweet" : s_tweet , "sentiment" : value } )

# Generating the dataframe
testing = pd.DataFrame( rows, columns = df_cols )

In [35]:

# It will take every observation and chekc if it is a list.
# If it is a list, we will get the first value and if not
# It will return the same value
testing2 = testing.applymap(lambda x: x[0] if isinstance(x, list) else x)

# Change datatype of sentiment column to str
testing2.sentiment = testing2.sentiment.astype(str)

# Print the counts of unique values of the sentiment list
print(testing2["sentiment"].value_counts())

# Drop NONE observations in sentiment column
testing3 = testing2[testing2["sentiment"]!="NONE"].copy().reset_index()
print('Dropping those classified as NONE')
# Check NONE is not the dataframe
print(testing3["sentiment"].value_counts())
print('Recoding categories')
# Changing sentiment variable to values
testing3["sentiment"]=testing3["sentiment"].replace("P+", 2)
testing3["sentiment"]=testing3["sentiment"].replace("NEU", 1)
testing3["sentiment"]=testing3["sentiment"].replace("N", 0)
testing3["sentiment"]=testing3["sentiment"].replace("N+", 0)
testing3["sentiment"]=testing3["sentiment"].replace("P", 2)
print('New data and value counts')
print(testing3.shape)
print(testing3["sentiment"].value_counts())


NEU     941
N       698
P       639
NONE    222
Name: sentiment, dtype: int64
Dropping those classified as NONE
NEU    941
N      698
P      639
Name: sentiment, dtype: int64
Recoding categories
New data and value counts
(2278, 3)
1    941
0    698
2    639
Name: sentiment, dtype: int64


In [37]:
# Exporting data
testing3.to_excel("B_OutputData/testingTASS_data.xlsx", index = False)