In [1]:
#import all neccessary libraries
import json
import pandas as pd
from pandas import json_normalize

In [2]:
#open json file
with open('normal_tweets.json', 'r') as json_file:
    data = json.load(json_file)

In [3]:
#normalize the json file
df = pd.DataFrame.from_dict(pd.json_normalize(data), orient='columns')

**Notice that the entire data is stored as one row and in one column. The code below addresses that.**

In [4]:
#separate by rows
df = df.explode('data', ignore_index=True)

#expand columns to see full text
pd.set_option('display.max_colwidth', None)

**Data is successfully separated as individual columns, however, it is still stored as one row column. The code below addresses that.**

In [5]:
#normalize the data again
df_data_normalize = json_normalize(df['data'])

**The code results in a DataFrame with 4 columns.**
- edit_history_tweet_id
- id
- created_at
- text

In [6]:
#drop unnecessary columns for our analysis
df = df_data_normalize.drop(columns=['edit_history_tweet_ids','id','created_at'])

**code produces a new dataframe that separates each tweet in its own row. We have a total of 24 rows and 1 column named, "text".**

## From this point onward I will be using basic Natural Language Processing to clean up the text

In [7]:
#import all neccessary libraries
import nltk
from langdetect import detect

In [8]:
#some tweets have links. We only want the actual twitter text, so we 
#have to remove https links. 
df['text'] = [x.split('https')[0] for x in df['text']]

**Although we have removed "https links", the dataset is still marked with extra spaces and '\n' text, indicating new lines in the tweet.**

In [9]:
#check and remove all 'n' and extra space in texts
def remove_newline(cell_value):
    return cell_value.replace("\n", " ")

#apply the remove_newline function to each cell in the specified column
df['text'] = df['text'].apply(remove_newline)

#convert all text to lower case to maintain standard in text standardization
df['text'] = df['text'].str.lower()

**Texts are all in lowercase, but our data must only contain Spanish tweets. We must remove tweets that are not detected in the Spanish language**

In [10]:
#function to detect text language
def detect_my(text):
    try:
        return detect(text)
    except: return 'unknown'
    
#we are creating a new column that will have values classifying the language each tweet is in 
df['language'] = df['text'].apply(detect_my)

In [11]:
#now we want to query spanish tweets
#this will effectively remove any tweet not in Spanish
new_df = df.query("language =='es'")['text']

#convert the results into a dataframe
new_df = pd.DataFrame(new_df)

**Our data now contains only Spanish tweets. Still, the data needs to be cleaned further for it to be ready for sentiment analysis**

In [12]:
#import proper libraries
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [13]:
#before removing filler words in our tweets, we need to tokenize it
new_df['text'] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

#set up stopwords to spanish
stop_words = set(stopwords.words('spanish'))

#apply to data
new_df = new_df['text'].apply(lambda x: [item for item in x if item not in stop_words])

#convert resuls to dataframe
new_df = pd.DataFrame(new_df)

**The final data shows each tweet in its own row, fully tokenized, and with no filler words**

**For privacy reasons and to stay compliant with Twitter's Policy, I do not think I can show the final outcome. Furthermore, at this point, I did not have enough to time to finish the overall analysis as I had to hand in my capstone. I was also not too positive if my approach to conduct sentiment analysis was right. However, given the research papers I have read on this topic, it is possible.**