### This notebook preprocesses text from a twitter airline sentiment [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment),<br> in preparation for generating embeddings with a sentence transformer model.


In [41]:
import sys
sys.path.append("../")
import pandas as pd
import numpy as np
import preprocessing.preprocessing as pp

Reading in the raw data:

In [42]:
df = pd.read_csv("../data/raw/twitter_airline_sentiment.csv")
print(df.info());

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

In [43]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [44]:
df.tail()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
14635,569587686496825344,positive,0.3487,,0.0,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0,Customer Service Issue,1.0,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)
14639,569587140490866689,neutral,0.6771,,0.0,American,,daviddtwu,,0,@AmericanAir we have 8 ppl so we need 2 know h...,,2015-02-22 11:58:51 -0800,"dallas, TX",


Print number of tweets and number of unique tweets:

In [45]:
print("Number of tweets: ", len(df["text"])) 
print("Unique tweets: ", len(set(df["text"])))

Number of tweets:  14640
Unique tweets:  14427


In [46]:
df["original_index"] = df.index
df_nd=df.drop_duplicates(subset=['text'], keep='last', ignore_index=True).copy()
print(len(df_nd))


14427


What do the retweets look like? <br>
Should they be kept?

In [47]:
retweets = [tweet for tweet in df_nd["text"] if "RT" in tweet]
print("Number of retweets in data set: ", len(retweets))
retweets[:10]

Number of retweets in data set:  117


['Nice RT @VirginAmerica: Vibe with the moodlight from takeoff to touchdown. #MoodlitMonday #ScienceBehindTheExperience http://t.co/Y7O0uNxTQP',
 "@VirginAmerica You'd think paying an extra $100 bucks RT for luggage might afford you hiring an extra hand at @sfo #lame",
 "Always have it together!!! You're welcome! RT @VirginAmerica: @jessicajaymes You're so welcome.",
 '😎 RT @VirginAmerica: You’ve met your match. Got status on another airline? Upgrade (+restr): http://t.co/RHKaMx9VF5. http://t.co/PYalebgkJt',
 'Awesome! RT @VirginAmerica: Watch nominated films at 35,000 feet. #MeetTheFleet #Oscars http://t.co/DnStITRzWy',
 "@VirginAmerica If you'd love to see more girls be inspired to become pilots, RT our free WOAW event March 2-8 at ABQ. http://t.co/rfXlV1kGDh",
 'Nice RT @VirginAmerica: The man of steel might be faster, but we have WiFi – just saying. #ScienceBehindTheExperience http://t.co/FGRbpAZSiX',
 '@united Pls Help Baby Hannah get the life saving surgeries she requires.She nee

Print a random sample of tweets:

In [48]:
[print(tweet) for tweet in df_nd["text"].sample(15, random_state=11)];

@united great to hear Thankyou so much. Greatly appreciate your replies. Feel much more settled now.
@united Tell me that you're at least going to cover a room and get me out of here.
@JetBlue I'm over that honestly just would like to get going on the journey.
@USAirways would like to see you do similar in PHL! http://t.co/n9vGe2nPIB
@SouthwestAir if you are giving tix to #DestinationDragons show would appreciate one or two for LA😄Flying from PHL to LAX on Friday
@AmericanAir still waiting on a dm response..... #sloooowresponses
@JetBlue is the trueblue site broken at the moment?
@AmericanAir That's good, I'd expect that but I can't get through on the phone to make any changes. Can I change it online?
@SouthwestAir worst air line ever, you have no compassion of the handicapped
@JetBlue I just wanted to say flight attendant fitz was the best tonight on flight #1326 bwi/Bos. Great guy and made the flight fantastic!
@united no- we are boarding- but why can't your agents, on the phone, tak

### Process tweets through selected preprocessing steps:

In [49]:
text = list(df_nd.text)


# List of text preprocessing functions with specified attributes to be applied. They are applied in the order they are listed.
preprocessing_steps = [
    {
        "name": "remove_emoji",  # Replaces emoji with descriptive words
        "attributes": {
            "replace": True
        },
    },
    {
        "name": "remove_urls"
    },
    {
        "name": "remove_html"
    },
    {
        "name": "remove_symbols",  # Removes all @user and #hashtags
        "attributes": {
            "symbols": ["@", "#"],
            "remove_keyword": [True, True]
        },
    },
    {
        "name": "replace_curly_quotes"
    },
    {
        "name": "remove_whitespace_currency"
    },
    {
        "name": "fix_whitespace"
    },
]

clean_text = pp.clean_text(text, preprocessing_steps)

df_nd["clean_text"] = clean_text

Calling remove_emoji with attributes {'replace': True}


Calling remove_urls
Calling remove_html


  soup = BeautifulSoup(t, "html.parser")


Calling remove_symbols with attributes {'symbols': ['@', '#'], 'remove_keyword': [True, True]}
Calling replace_curly_quotes
Calling remove_whitespace_currency
Calling fix_whitespace


In [50]:
# Shows a random subset of tweets before and after cleaning

ind = np.random.choice(len(df_nd), 10)

for t, c in zip(df_nd.text.to_numpy()[ind], df_nd.clean_text.to_numpy()[ind]):
    print(f"Original: {t}")
    print(f"Cleaned:  {c}")
    print()

Original: @USAirways @_RobPrice  how can he try again it will be 5-6 hours before he gets help.  Contingency plans non existent.
Cleaned:  how can he try again it will be 5-6 hours before he gets help. Contingency plans non existent.

Original: @JetBlue Well, thankfully they've got a nice food court here...When will an update be posted?
Cleaned:  Well, thankfully they've got a nice food court here.. .When will an update be posted?

Original: @SouthwestAir thanks so much just had to make a Cancelled Flightlation! I've sent u the info.
Cleaned:  thanks so much just had to make a Cancelled Flightlation! I've sent u the info.

Original: @AmericanAir delayed on the way to Puerto Rico and delayed on the way back to New York, this is disgraceful
Cleaned:  delayed on the way to Puerto Rico and delayed on the way back to New York, this is disgraceful

Original: @eatgregeat WOW~Thx for thinking of us, Greg! Heard #SOBEWFF was amazing! We've heard the same about @JetBlue (ps thx for the info) #Te

In [51]:
save_str = ""
for i in range(len(preprocessing_steps)):
    if preprocessing_steps[i]["name"] == "remove_symbols":
        save_str = save_str + "_" + preprocessing_steps[i]["name"].split("_")[-1] + "".join(preprocessing_steps[i]["attributes"]["symbols"])
    else:
        save_str = save_str + "_" + preprocessing_steps[i]["name"].split("_")[-1]
save_str

'_emoji_urls_html_symbols@#_quotes_currency_whitespace'

In [52]:
df_nd.to_csv(f"../data/processed/twitter_airline_sentiment_cleaned{save_str}.csv", index=False)