# Exploring Pandas' features throught the TED Talks dataset

In this tutorial, we will use the [TED Talks dataset](https://www.kaggle.com/rounakbanik/ted-talks), available from Kaggle Datasets under the [CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).

This tutorial is partially based on the [Data Science Best Practices with pandas](https://github.com/justmarkham/pycon-2019-tutorial) tutorial presented by Kevin Markham at PyCon2019 on May 2, 2019.

## Import the required libraries

Sometimes you need to know the pandas version you are using, for example, when you need to consult the pandas documentation. You get the pandas version with:

## Load and examine the TED talks dataset

### Check for the presence of missing values

Let's identify who are the speakers with missing occupation and if we can identify their occupation from other records (in case they gave more than one talk)

Compute the number of observations (talks) per each speaker

Among all speakers, select those with the missing occupation

Unfortunately, there are no additional records that could be used to fill the missing occupation values.

### Transform filming date and publication date into datetime columns

Examine a sample of film_date and published_date values

Note that both dates are given as *Unix epoch time*, that is, the number of seconds that have elapsed since January 1, 1970. 

For more about epoch time, see: https://www.epochconverter.com/

Let's start by transforming the film_date into the datetime type

Then, do the same for the published_date

### Drop columns that are no longer needed

### How to change columns' names? 

The most flexible method for renaming columns is the `rename()` method. One should pass it a dictionary in which the keys are the old column names, while the values are the new names, and specify the axis to be 'columns'.

For example, rename columns 'comments' and 'views' to 'comment_count' and 'view_count', respectively

In case you may need that, you can get column names as a list, as follows:

## Task 1: Compute and plot the number of talks that took place each year

To do this, we need to extract the year from the filming date, and group talks on the year basis

First, add the event_year column

Then, compute the number of talks per year

Note that the counts are by default sorted in descending order of count value. This is fine if we are interested in identifying years with the highest / lowest number of talks. <br>
However, if we want to plot the number of talks per year, we need the talk counts ordered based on the year. To get that, we can do the sorting based on the index:

Now, we can plot talk counts across years

## Task 2: Which TED events attracted the most attention?

Consider the number of views as a proxy of an event's attractiveness

Compute average number of views per talk during each event

It might be the case that some of these events got high mean views due to having a small number of very popular talks, or even just one very popular talk. So, consider also the number of talks at each event.

To aggregate data based on more than one function (e.g., in this case, mean and count), we can follow `groupby()` with the `agg()` function that receives a list of aggregation functions we want to apply to the grouped data.

Let's store the results in a new data frame

Now, we can examine, for each event, both the average number of views and number of talks

Note that all top 10 events (based on the average views) had at most 2 talks. <br>
Let's check the stats for the number of talks and mean views per event

To get some further insights, let's consider only events with above average number of talks. Considering highly skewed distribution, we'll use median as the average value

## Task 3: Explore talks based on their ratings

Take a closer look at a sample of ratings

Even though a 'ratings' value looks like a list, it is not

To convert talk ratings into a list - so that they can be further processed - we will use a function from the Python's `ast` (*Abstract Syntax Tree*) module:

`literal_eval()` function allows one to evaluate a string containing a Python literal or container, that is, it can be used to transform a string into a literal value, a list, a tuple or any other container object

Create a new column for storing ratings as a list, instead of a string.

To that end, we will use the `apply()` f. to apply the `ast.literal_eval()` f. to each value of the ratings column

### Task 3.1: For each talk, find the 3 most frequent ratings

Add a column with a tuple comprising names of the 3 most frequent ratings for the corresponding talk

One way to approach this task is to create a function that receives a list of ratings for one talk and returns a tuple with the 3 most frequent ratings

Now, apply the function to the 'ratings_list' of each talk to create a new column (e.g. top3_ratings)

Print 'top3_ratings' of the 10 most viewed talks to see what were the primary ways of ratings those talks

### Task 3.2 Which TED events had the most 'Jaw-dropping' talks?

This task can be interpreted in different ways. One way is that we qualify a talk as *jaw-dropping* if 'Jaw dropping' ratings were among the top 3 most frequent kind of ratings for that talk.

Let's start by taking a subset of talks that are *jaw-dropping*, as qualified above

Note the selection based on a condition expressed as a function of column (top3_ratings) values

Take a sample of jaw dropping talks and check that 'Jaw-dropping' is really among the top 3 ratings

To answer the posed question, we can now simply compute the number of talks per event in the newly created data frame with jaw dropping talks

### Task 3.3 Compute the number of positive, negative, and neutral ratings, then add one column for each of these counts

This practically means that we should add 3 columns - pos_ratings, neg_ratings, neutral_ratings - with the corresponding rating counts.

We will achieve this through a multistep process:

1) Identify different kinds of rating categories that have been used to characterise talks and classify them as positive, negative, or neutral

2) Create a function for computing the number of ratings in each of the 3 classes (positive, negative, neutral)

3) Add a column - ratings_counts - storing the computed values as tuples of the form (pos_ratings, neg_ratings, neutral_ratings)

4) Transform the ratings_counts column into 3 columns: pos_ratings, neg_ratings, neutral_ratings 

**Step 1.1**: Identify different kinds of rating categories for each talk and store them as a list in a new, auxiliary column (e.g. rating_categories)

Next, we need to identify unique rating categories across the categories lists of all talks

**Step 1.2**: Classify rating categories as positive, negative, or neutral

**Step 2**: Create a function that computes the number of ratings in each of the 3 classes (positive, negative, neutral)

**Step 3**: Add a column - *ratings_counts* - storing the output of the above function applied to the ratings_list column

**Step 4.1**: Transform the *ratings_counts* column into a data frame with 3 columns: pos_ratings, neg_ratings, neutral_ratings

To that end, we can `apply()` the Series constructor on the *ratings_counts* column

**Step 4.2**: Merge this new data frame with the original one (ted); this can be done using the `concat()` function. The auxiliary column - ratings_counts - can be dropped as it is no longer needed

### Task 3.4 Find speakers with the highest average number of positive ratings per talk

We need to group talks based on the 'main_speaker' column and, for each group, compute mean value of the pos_ratings column

To answer this more thoroughly, we'll consider also the number of talks that a speaker gave

Identify the top 10 speakers among those who gave more than 1 talk

### Task 3.5 Identify 10 speakers with the largest proportion of negative ratings 

Start by computing the proportion of negative ratings for all the talks

Group talks by the main_speaker and, for each group (that is, speaker), compute the average proportion of negative ratings and number of talks

Maybe those with one talk just had a bad day or were not experienced enough... 
So, let's focus on speakers with 3+ talks

### Task 3.6 Which occupations deliver the funniest TED talks on average?

(Consider also how well represented occupations are)

We'll start by computing the number of Funny ratings per talk and storing the results in the (new) column 'funny_ratings'

This may be fine, but absolute values tend to be misleading - it is often better to take relative values, that is, proportions. <br>
Therefore, instead of considering the funniest talks with the highest absolute values of Funny ratings, better choose talks with the highest proportion of Funny ratings.  

Let's examine absolute and relative counts of the top 10 fanniest talks based on the absolute counts. <br>(the idea is to see the difference of absolute vs relative counts)

Next, do the grouping based on the speaker_occupation and compute, for each group (i.e., each occupation), average counts and proportions of Funny ratings

Take the top 10 occupations based on the average count of Funny ratings

Now, do the same, but using average proportion of Funny ratings as the criterion

It seems that relative values serve as the better criterion for selecting occupations that delivered the funniest talks. So, we'll continue with the funny_ratings_prop values.

Note that some occupations sound rather exotic and unique (e.g. "Gentleman thief", "Science humorist"). Let's also consider the frequency of different occupations

Note that many occupations seem to appear only once. To verify this, let's compute the proportion of infrequent occupations

We cannot make any conclusion about an occupation based on just one representative. So, let's limit our analysis to those occupations that have at least a few (e.g. five) representatives. To that end, select those with frequency >= 5

Let's see how many occupations we have selected as "well represented"

They form only a tiny portion (5%) of all occupations.

Finally, let's examine how funny are talks by the representatives of such occupations.

As expected, comedians gave the funniest talks. On the other hand, talks by physicians and surgeons, again, as expected, are the least funny. <br>
Something probably unexpected: data scientists got 5th place (among 68 occupations)

### Task 4. Examine the topics of the top 100 'Inspiring' talks, present them in a tag cloud 

While talks can be considered the best based on a variety of criteria, we will value and rank them based on the proportion of positive ratings 

Let's start by creating a subset of talks that were rated as 'Inspiring' 

Next, order these talks based on the proportion of positive ratings and take top 100

To be able to access rows of this new data frame using regular indices (0,1,2,...), we need to reset its index

Next, let's examine tags associated with the inspiring talks

It seems that tags of a particular talk are stored as a list. But, we should take a closer look...

Ratings are, in fact, encoded as a string...  <br>
So, we (again) have to use `ast.literal_eval()` function to get a list out of a string (representation of the list)

Next, we will create a dictionary of the tags that were used to describe the inspiring talks. Keys in this dictionary will be individual tags, while values will be frequencies of tags' occurrences in relation to the inspiring talks. We need this type of dictionary for the creation of a tag cloud.

How many unique tags were identified?

How frequent those tags are? Compute some basic statistics that describe tag frequency distribution 

Keep only tags with above average (median) frequency

Print 15 most frequent tags (and their frequencies)

Now, we can create a word cloud

Some useful materials for word cloud: <br>
https://www.datacamp.com/community/tutorials/wordcloud-python <br>
https://gist.github.com/izikeros/fca85e2d7b9eae3e0d9dec6a1f1635b3