Hi, everyone! This notebook contains sentiment analysis of airline tweets. 

In [None]:
library(dplyr)
library(tidytext)
library(RColorBrewer)
library(ggplot2)
library(wordcloud)
library(tm)
options(warn=-1)

dataset <- read.csv('../input/Tweets.csv')

Firstly, let's check the data structure. 

In [None]:
str(dataset)

After loading the dataset, we should transform tweets' text into words for finding the most frequent words in each sentiment. Before that, text was converted factor to character. 

In [None]:
dataset$text <- as.character(dataset$text)
tidy_dataset <- dataset %>%
  unnest_tokens(word, text)

*- Counts of each sentiments*

In [None]:
summary(dataset$airline_sentiment)

As we can see from summary of airline sentiment, negative tweets are higher than others. Thats mean, people tend to tweet more in negative issues. 

**• Visualization of whether the sentiment of the tweets was positive, neutral, or negative for each airlines**

In [None]:
ggplot(dataset, aes(x = airline_sentiment, fill = airline_sentiment)) +
  geom_bar() +
  facet_grid(. ~ airline) +
  theme(axis.text.x = element_text(angle=65, vjust=0.6),
       plot.margin = unit(c(3,0,3,0), "cm"))

United, US Airways, American substantially get negative reactions.

**• The Most Frequent Words in Positive Sentiment**

In [None]:
positive <- tidy_dataset %>% 
  filter(airline_sentiment == "positive") 

The most 21 frequent words contains too much prepositional phrase. It would be better with removing these phrases. 

In [None]:
list <- c("to", "the","i", "a", "you", "for", "on", "and", "is", "are", "am", 
          "my", "in", "it", "me", "of", "was", "your", "so","with", "at", "just", "this",
          "http", "t.co", "have", "that", "be", "from", "will", "we", "an", "can")

positive <- positive %>%
  filter(!(word %in% list))

In [None]:
wordcloud(positive[,15],
          max.words = 100,
          random.order=FALSE, 
          rot.per=0.30, 
          use.r.layout=FALSE, 
          colors=brewer.pal(10, "Blues"))

In [None]:
positive <- positive %>%
  count(word, sort = TRUE) %>%
  rename(freq = n)

head(positive, 21)

In [None]:
positive <- positive %>%
  top_n(21)
colourCount = length(unique(positive$word))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))

# The Most 21 Frequent Words in Positive Tweets
positive %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(x = word, y = freq)) +
  geom_col(fill = getPalette(colourCount)) +
  coord_flip() 

This visualization shows us 'thanks', ' thank', 'great', 'love', 'good', 'best', 'awesome' words are some of the frequently used positive words in tweets. 

**• The Most Frequent Words in Negative Sentiment**

In [None]:
negative <- tidy_dataset %>% 
  filter(airline_sentiment == "negative") 

negative <- negative %>%
  filter(!(word %in% list))

wordcloud(negative[,15],
          max.words = 100,
          random.order=FALSE, 
          rot.per=0.30, 
          use.r.layout=FALSE, 
          colors=brewer.pal(10, "Reds"))


In [None]:
negative <- negative %>%
  count(word, sort = TRUE) %>%
  rename(freq = n)

negative <- negative %>%
  top_n(21)
colourCount = length(unique(negative$word))
getPalette = colorRampPalette(brewer.pal(8, "Dark2"))

# The Most 21 Frequent Words in Negative Tweets
negative %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(x = word, y = freq)) +
  geom_col(fill = getPalette(colourCount)) +
  coord_flip() 


'not', 'no', 'cancelled', 'help', 'but', 'customer', 'time' words are some of the frequently used negative words in tweets. 

**Intersection of positive and negative words**

In [None]:
intersect(negative$word, positive$word)

**Top words which included in only positive sentiment**

In [None]:
setdiff(positive$word, negative$word)

**Top words which included in only negative sentiment**

In [None]:
setdiff(negative$word, positive$word)

**• What is the negative reason ?**

In [None]:
dataset %>%
  filter(negativereason != "") %>%
  ggplot(aes(x = negativereason)) + 
  geom_bar(fill = "tomato") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

This visualization shows us people mostly complain about customer service. After that, late flight is the another reason of complaints. 

**• The Most Frequent Words in Neutral Sentiment**

In [None]:
neutral <- tidy_dataset %>% 
  filter(airline_sentiment == "neutral") 

neutral <- neutral %>%
  count(word, sort = TRUE) %>%
  rename(freq = n)

neutral <- neutral %>%
  filter(!(word %in% list))

head(neutral, 21)

In [None]:
neutral <- head(neutral, 21)

colourCount = length(unique(neutral$word))
getPalette = colorRampPalette(brewer.pal(12, "Set3"))


# The Most 21 Frequent Words in Neutral Tweets
neutral %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(x = word, y = freq)) +
  geom_col(fill = getPalette(colourCount)) +
  coord_flip() 

**• How many words for each sentiment ?**

In [None]:
totals <- tidy_dataset %>%
  # Count by tweet id to find the word totals for tweet
  count(tweet_id) %>%
  # Rename the new column
  rename(total_words = n) 


totals <- dataset %>%
  inner_join(totals, by = "tweet_id") %>%
  select(tweet_id, total_words, airline_sentiment) %>%
  arrange(desc(total_words))

totals <- head(totals, 20)

ggplot(totals, aes(x = airline_sentiment , y = total_words, fill = airline_sentiment)) +
  geom_col() +
  scale_fill_brewer(palette="Paired")

To sum up, people more tweets longer text while encountering negative situations. 

**Thank you for reading! :) **