# Plotting with seaborn
Today you will be introduced to the `seaborn` library. Seaborn is a library for data visualization in python, based on `matplotlib`. It works very well together with pandas.

First, make sure to install seaborn and matplotlib:

```sh
pip install matplotlib
pip install seaborn
```

You can find more about seaborn here: https://seaborn.pydata.org/

In [None]:
# Some imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import collections

sns.set()

We will use a dataset containing tweets and sentiments about airlines from https://www.kaggle.com/crowdflower/twitter-airline-sentiment. First, load the data and have a look:

In [None]:
df = pd.read_csv('Tweets.csv')
df.tail()

It would be nice to have some more information about the data we are seeing. What is the label distribution? Seaborn comes with a `countplot` function and helps us to visualize the sentiment labels in only one line of code:

In [None]:
sns.countplot(x='airline_sentiment', data=df)

Though it does look nice, let us also consider colorblind people:

In [None]:
sns.set_palette('colorblind')
sns.countplot(x='airline_sentiment', data=df)

What kind of airlines are covered? How many reviews do they have? Let's find out. This time, show the count on the `y` axis.

In [None]:
sns.countplot(y='airline', data=df)

So far, colors seem nice, but they do not give any additional information. Most plots in seaborn take a `hue` argument. This argument can be used to visualize another dimension via its coloring. Let's combine the information of the sentiment of a review together with the number of reviews per airline:

In [None]:
sns.countplot(y='airline', data=df, hue='airline_sentiment')

Now we have a much better impression of what the data looks like. We can count the instances of categorical variables. Since we are doing NLP, let's find out how many characters a review contains. A good way to visualize the distribution of characters per review would be a histogram.

In [None]:
# Compute character length
df['text_length'] = df['text'].apply(lambda review: len(review))
df.head()

In [None]:
# Plot a histogram
sns.histplot(data=df, x='text_length', binwidth=3)

The `binwidth` determines the width of each bar (bin). The y-axis shows how many instances fall into each of the bins on the x-axis. It seems that the ditribution is highly skewed towards 140 characters. Is that the same for reviews of each sentiment? We can again use the `hue` attribute. Additionally, `kde=True` fits a continuous curve on the distribution.

In [None]:
sns.histplot(data=df, x='text_length', binwidth=3, hue='airline_sentiment', kde=True)

It seems that people with a negative opinion tend to write as much as possible. The distribution of neutral and positive reviews is more uniform.

**Task:**
Look at the documentation and find out how else we can visualize the distribution of the character length while keeping the three sentiment classes separatly. E.g. using colored stacked bars.

In [None]:
# TODO

Let us look at another function, the `lineplot`. We want to know how many tweets (of our very small dataset) have been tweeted per day.

In [None]:
# First convert the `tweet_created` field into a datetime object ...
df['tweet_created'] = pd.to_datetime(df['tweet_created'])
# ... then extract the date (without the time information)
df['tweet_created_date'] = df['tweet_created'].dt.date
df.head()

Prepare the data: Lets keep the number of tweets per day, airline and sentiment separatley

In [None]:
df_counts = df.groupby(by=['tweet_created_date', 'airline', 'airline_sentiment']).count().reset_index()
df_counts['number_tweets'] = df_counts['tweet_id']
df_counts = df_counts.loc[:, ['tweet_created_date', 'airline', 'airline_sentiment', 'number_tweets']]
df_counts.head()

In [None]:
# We dont have a non-zero number of tweets for every combination. 
# Just fill in zeros for all remaing combination where no such tweet occured.

# Get all (unique) values
dates = list(set(df['tweet_created_date']))
airlines = list(set(df['airline']))
sentiments = list(set(df['airline_sentiment']))

# Store all data here, that must be added
data_to_add = collections.defaultdict(list)

# Now go through every combinatin
for date in dates:
    for airline in airlines:
        for sentiment in sentiments:
            
            has_date = df_counts['tweet_created_date'] == date
            has_airline = df_counts['airline'] == airline
            has_sentiment = df_counts['airline_sentiment'] == sentiment
            
            # do we have an entry for this?
            if len(df_counts[has_date & has_airline & has_sentiment]) == 0:
                # If not add one entry with 0 tweets
                data_to_add['tweet_created_date'].append(date)
                data_to_add['airline'].append(airline)
                data_to_add['airline_sentiment'].append(sentiment)
                data_to_add['number_tweets'].append(0)
                
# Combine and sort values            
df_counts = pd.concat((df_counts, pd.DataFrame(data_to_add)))
df_counts = df_counts.sort_values(by=['tweet_created_date', 'airline', 'airline_sentiment'])
df_counts.head(20)

For the sake of simplicity, let us only consider two airlines:

In [None]:
df_counts = df_counts[(df_counts['airline'] == 'US Airways') | (df_counts['airline'] == 'United')]
df_counts.head(6)

And then call the `lineplot` function. This time we want the sentiment to be visualized by color, but addtionally to separate both airlines. Luckily we can do this via the `style` argument:

In [None]:
sns.lineplot(data=df_counts, x='tweet_created_date', y='number_tweets', style='airline', hue='airline_sentiment')
plt.xticks(rotation=45)
plt.show()

That's it. You can check out the Gallery (https://seaborn.pydata.org/examples/index.html) for more plots.