<a href="https://www.kaggle.com/code/julioam/ted-talk-explore?scriptVersionId=163104485" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

In [None]:
#import data
df = pd.read_csv("../input/ted-talks/data.csv")

In [None]:
print(df.head())

In [None]:
#info of dataset
print(df.info())

the data consist of integer and object data types

# Data Cleaning

## Missing values

In [None]:
df.isnull().sum()/len(df)*100

author has missing values. Since the percentage is very low, the row where authos is missing will be removed 

In [None]:
df2 = df.dropna(axis=0)

In [None]:
#Check new dataset
df2.info()

There is no missing value in the dataset anymore. We can continue to the analysis

# Data Analysis

## Finding the most popular TED Talks

To answer this question, I will look at TED Talks title with highest views

In [None]:
#Show only top 10
popular_tedx = df.loc[0:9,['title','views']].sort_values(by='views', ascending=False, ignore_index=True)
popular_tedx

fig=plt.figure(figsize=(10,8))
plt.barh(width='views', y='title', data=popular_tedx[:10], color = 'darkred', edgecolor='darkred')
#plt.axes().set_facecolor('black')
plt.ylabel('Title of the talk')
plt.xlabel('Total views')
plt.title('Top 10 Most Popular TED Talks');

## Finding the most popular TED talks Speaker (in terms of number of talks)

To answer this question, I will show counts of every speaker

In [None]:
#Top 10 speaker with most number of talks
#plt.figure(figsize=(8,6))
sns.countplot(y = 'author', data = df, color = 'darkred', edgecolor = 'darkred', order = df.loc[:,'author'].value_counts().iloc[:9].index)
sns.set(rc={'axes.facecolor':'black'})
plt.xlabel('Number of talks')
plt.title('Top 10 Most Popular TED Talks Speaker')
plt.show()

## Month-wise Analysis of TED talk frequency
I will see the frequency of TED talk on monthly basis

In [None]:
#Convert date datatype to string
df3 = df2.copy()
df3.loc[:,'date'] = df3.loc[:,'date'].astype('string')

#Split the month and year
df3[['Month', 'Year']] = df3['date'].str.split(' ', expand=True)
df3.drop('date', inplace=True, axis=1)

#Number of talk per month
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']
plt.figure(figsize=(14,6))
fig_3 = sns.countplot(x = 'Month', data = df3, color = 'darkred', edgecolor = 'darkred',
            order = month_order)
plt.ylabel('Frequency (Number of Talks)')
plt.title('Talk frequency per Month')
plt.show()

The month with most of talk frequency is February, while the month with fewest of talk frequency is January

## Year-wise Analysis of TED talk frequency
I will see the frequency of TED talk on yearly basis

In [None]:
#Number of talk per month
num_talk_month = df3['Year'].value_counts().reset_index().sort_values(by='Year')

plt.figure(figsize=(11,5))
fig_4 = sns.lineplot(x = 'Year', y='count', data = num_talk_month, color = 'darkred')
plt.ylabel('Frequency (Number of Talks)')
plt.xticks(rotation=45)
plt.title('Talk frequency per Year')
plt.show()

Overall, the talk frequency is incereasing from year to year. 2019 is the year with most talks while 1970-early 1990 is the years with fewest talks

## Finding TED talks of your favorite Author
Honestly, I just watch some of TED Talk with many speaker and I don't have favorite speaker, so I will choose author with the most number of talks

In [None]:
fav_author = df3.loc[df3['author'] == 'Alex Gendler']['title']

print(fav_author[:10])

## Finding TED talks with the best view to like ratio
To find the ratio, I will divide views with likes

In [None]:
#Defining ratio
df3['ratio'] = df3['views']/df3['likes']

#Top 10 view to like ratio
top_ratio = df3.loc[:, ['title', 'ratio']].sort_values(by='ratio', ascending=False, ignore_index=True)
plt.figure(figsize=(12,6))
plt.barh(width='ratio', y='title', data=top_ratio[:10], color = 'darkred', edgecolor='darkred')
plt.ylabel('Title of the talk')
plt.xlabel('View to like ratio')
plt.title('Top 10 Talks with Highest Ratio')
plt.show()

Most of view to like ratio is around 33-35

## Finding the most popular TED talks Speaker (in terms of number of views)
From solution to question 2, we know that some of the speakers have more than 1 talk so we will group dataset by author and find the sum of views for every speakers

In [None]:
#Find the views for every speaker
total_views = df3.groupby('author').sum().sort_values(by='views', ascending=False).reset_index()
total_views_top_10 = total_views.iloc[:10, :1].reset_index()

#Top 10 speaker with most number of views
plt.figure(figsize=(8,6))
plt.barh(width='views', y='author', data=total_views[:10], color = 'darkred', edgecolor='darkred')
plt.xlabel('Total views')
plt.title('Top 10 Most Popular TED Speakers')
plt.show()

Alex Gendler is the most popular TED Talk Speaker

## Finding TED talks based on tags (like climate)

For the tag, I will take it from the title of TED talks (could be improved with other method)

In [None]:
#climate tag
tag_df = df3[df3['title'].str.contains('climate')]
print(tag_df.iloc[:, :5])

In [None]:
#sport tag
sport_df = df3[df3['title'].str.contains('sport')]
print(sport_df.iloc[:, :5])