<a href="https://colab.research.google.com/github/nisha432/ted-talk-view-prediction/blob/main/TED_Talk_View_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - TED Talk View Prediction  



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


TED is dedicated to researching and sharing knowledge that matters through short talks and presentations. Their goal is to inform and educate global audiences in an accessible way.
The main objective of this project is to build predictive model,which could help in predicting the view of the videos uploaded on the TEDx website , so that they can bring the same kind of talks on the topics which had the good number of views . 



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as snb

from scipy import stats
import missingno as mn




### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
dataset = pd.read_csv('/content/drive/MyDrive/AlmaBetter/data_ted_talks.csv')

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

The given dataset had 4005 rows ,and 19 columns.

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df=dataset. pivot_table(index=['talk_id','title','speaker_1','all_speakers'],aggfunc ='size')
df

#### Missing Values/Null Values

In [None]:
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    return summary
result = resumetable(dataset)
result.sort_values('Missing', ascending= False)


In [None]:
 # Missing Values/Null Values Count
 print(dataset.isnull().sum())


In [None]:
# Visualizing the missing values

In [None]:
plt.figure(figsize=(10,6))
snb.heatmap(dataset.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})
plt.savefig("visualizing_missing_data_with_heatmap_Seaborn_Python.png", dpi=100)

In [None]:
plt.figure(figsize=(10,6))
snb.displot(
    data=dataset.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)
plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)

In [None]:
 print(dataset.isnull().values.sum())

### What did you know about your dataset?

The given dataset has 4005 rows and 19 columns.Many columns like comments,occupations, about_speakers,all_speakers,recorded_date have missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns=list(dataset.columns)
columns

In [None]:
# Dataset Describe 
#Describing only numeric fields are returned.


dataset.describe()

In [None]:
#Including only string columns in a DataFrame description.
dataset.describe(include='object')


In [None]:
# Describing all columns of a DataFrame regardless of data type.
dataset.describe(include='all')

### Variables Description 

talk_id: Talk identification number provided by TED

title: Title of the talk

speaker_1: First speaker in TED's speaker list

all_speakers: Speakers in the talk

occupations: Occupations of the speakers

about_speakers: Blurb about each speaker

recorded_date: Date the talk was recorded

published_date: Date the talk was published to TED.com

event: Event or medium in which the talk was given

native_lang: Language the talk was given in

available_lang: All available languages (lang_code) for a talk

comments: Count of comments

duration: Duration in seconds

topics: Related tags or topics for the talk

related_talks: Related talks (key='talk_id',value='title')

url: URL of the talk

description: Description of the talk

transcript: Full transcript of the talk


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df = pd.DataFrame(dataset)
print(df["talk_id"].unique())
print(df["title"].unique())
print(df["speaker_1"].unique())
print(df["all_speakers"].unique())
print(df["occupations"].unique())
print(df["about_speakers"].unique())
print(df["views"].unique())
print(df["recorded_date"].unique())
print(df["published_date"].unique())
print(df["event"].unique())
print(df["native_lang"].unique())
print(df["available_lang"].unique())
print(df["comments"].unique())
print(df["duration"].unique())
print(df["topics"].unique())
print(df["related_talks"].unique())
print(df["url"].unique())
print(df["description"].unique())
print(df["transcript"].unique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
dataset.drop(['talk_id','about_speakers','url','description','transcript'], axis=1)


### What all manipulations have you done and insights you found?

I have droped some of the colums ,which were not useful for finding the insights.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
ax = snb.barplot(x="duration", y="speaker_1", data=dataset.sort_values('duration', ascending=False)[:20])


##### 1. Why did you pick the specific chart?

I chose this chart because ,Many people dislike watching lengthier films unless they are really engaging SO it may be beneficial to identify the speaker who has delivered the longest presentation.











##### 2. What is/are the insight(s) found from the chart?

The plot above indicates that Chris Anderson delivered a presentation for an extended duration.









##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, this insight can prove to be quite valuable. By analyzing this graph, we can assess the level of engagement of the audience towards a speaker and predict whether they were successful in capturing the viewers' attention. Additionally, this graph can serve as a foundation for further analysis and aid in making informed decisions regarding future events or speakers


#### Chart - 2

In [None]:
# Chart - 2 visualization code
ax = snb.barplot(x="views", y="speaker_1", data=dataset.sort_values('views', ascending=False)[:20])


##### 1. Why did you pick the specific chart?

I selected this particular graph because I wanted to determine the viewership of longer videos and whether they receive a substantial number of views or not. The graph can help me analyze if viewers lose interest in longer videos and if it is necessary to keep the duration shorter to capture their attention.






##### 2. What is/are the insight(s) found from the chart?

In the above plot, we can observe that Chris Anderson's name is not present, but several other names from the previous plot are visible.







##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, that's correct! Analyzing the speaker with the maximum views can provide valuable insights into what topics and presentation styles are resonating with the audience. It's also interesting to note that some speakers, such as Bill Gates, can hold the audience's attention for longer durations, despite the length of their talks. This could be due to a variety of factors, such as the speaker's charisma, the relevance of the topic, or the way the presentation is structured. By analyzing these patterns, businesses can gain a better understanding of how to create content that is both engaging and informative for their target audience.





 Is there any relationship between view and duration? is the question from above two graphs 

In [None]:
#let's see the distribution of views
snb.displot(dataset[dataset['views'] < 0.4e7]['views'])
#let's see the distribution of duration
snb.displot(dataset[dataset['duration'] < 0.4e7]['duration'])


In [None]:
snb.jointplot(x='views', y='duration', data=dataset)


In general, longer talks tend to have a longer duration, although there can be some variation in this relationship depending on the specific talk. This is likely because the more content a talk covers, the longer it will take to deliver it. However, as I mentioned earlier, the length of a talk may not be the only factor that affects its popularity. Other factors, such as the speaker, topic, and style of presentation, may also play a role in determining the success of a TED Talk.





#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# finding out the top most native language used 
z=dataset['native_lang'].value_counts().head(10)
print(z)

In [None]:
# Create the plot object
z.plot(kind='bar',color='green',width=0.8)
plt.title('language used',fontdict={'fontsize':20,'fontweight':'bold','fontstyle':'oblique'})
plt.ylabel('lanuage count',fontdict={'fontsize':20,'fontweight':'normal'})
plt.xlabel('native laguage',fontdict={'fontsize':20,'fontweight':'normal'})
plt.rcParams['figure.figsize'] = (10,10)


##### 1. Why did you pick the specific chart?

I chose this chart to determine which native language has been used the most.







##### 2. What is/are the insight(s) found from the chart?

 This shows that english language has been used widely .

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight highlights that English is a popular language due to its global reach and wide understanding. Therefore, we can use English more frequently. However, it is also important to focus on other native languages since there may be some people who prefer to consume content in their own language. Neglecting these languages could result in a slightly negative impact on growth

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
snb.jointplot(x='views', y='comments', data=dataset)


##### 1. Why did you pick the specific chart?



I chose this chart to investigate the relationship between views and comments. It appears that popular videos tend to have more comments and foster more discussion.








##### 2. What is/are the insight(s) found from the chart?

This chart shows that there is a strong relationship between views and comments .

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight highlights that videos with a significant number of views and comments are likely to be popular due to their compelling content, which could be due to the topic or the speaker. Therefore, creating similar talks that are engaging and informative could have a positive impact on the business by attracting more views and encouraging audience engagement.





#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
dataset[['title', 'speaker_1','views', 'comments', 'duration']].sort_values('views', ascending=False).head(10)

##### 1. Why did you pick the specific chart?

I chose this chart to determine which talk became popular based on its number of views.








##### 2. What is/are the insight(s) found from the chart?

Based on the above chart, it is clear that Sir Ken Robinson is the most popular TED speaker by views, and his talk on the topic "Do Schools Kill Creativity?" is the one that propelled him to popularity.





##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can be useful because a talk can become popular either due to the topic or the speaker. Therefore, if we want to create engaging and popular talks, we can consider bringing in speakers who have previously given popular talks, and we can also explore topics that have previously been successful. By doing so, there is a higher chance that our future talks will resonate with our audience and become popular.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
dataset[['title', 'speaker_1','views', 'comments', 'duration']].sort_values('comments', ascending=False).head(10)

##### 1. Why did you pick the specific chart?

I selected this chart to determine which talk became popular based on its number of comments. Since comments allow viewers to discuss the topic and share their thoughts and opinions, a higher number of comments can indicate that the topic is particularly interesting and engaging. Additionally, the speaker could also be a factor in generating more comments. Therefore, by looking at the title and speaker of the popular talks with high comment counts, we may be able to identify which topics and speakers are likely to generate more audience engagement in the future.





##### 2. What is/are the insight(s) found from the chart?

According to the chart, the talk titled "Militant Atheism" by Richard Dawkins received the maximum number of comments, indicating that the topic and speaker were particularly engaging and sparked a lot of discussion among viewers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight highlights that the "Militant Atheism" talk by Richard Dawkins received a high number of comments, indicating that it was a particularly engaging and thought-provoking topic. To increase the business, TED could consider bringing in speakers who can explore similar topics that are likely to spark audience engagement and generate more comments. Additionally, they could also try pairing speakers with different topics to see if they can generate similar levels of audience engagement and comments.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
df = pd.DataFrame(dataset)
df

In [None]:
df['published_year'] = pd.DatetimeIndex(df['published_date']).year
df.head()

In [None]:
df.loc[:, 'published_year']

In [None]:
df['published_year'].unique()

In [None]:
df['published_year'].nunique()

In [None]:
years=df['published_year'].unique()
print(years)
number_of_talks=list(df['published_year'].value_counts(ascending=True).sort_index(ascending=True))
print(number_of_talks)

In [None]:
plt.plot(years,number_of_talks,color='red', marker='o')
plt.title('Number_of_talks published v/s years',fontdict={'fontsize':20,'fontweight':'bold','fontstyle':'oblique'})
plt.xlabel('Years',fontdict={'fontsize':20,'fontweight':'normal','fontstyle':'oblique'})
plt.ylabel('Number of talks published',fontdict={'fontsize':20,'fontweight':'normal','fontstyle':'oblique'})
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I have chosen this graph to gain insights into the growth or decline of the business.





##### 2. What is/are the insight(s) found from the chart?

Based on the data provided in the graph from the years 2006-2020, we can see that there was a steady increase in the number of talks published from 2006 to 2012. However, after 2012, there was a decrease in the number of talks published. The trend reversed in 2015, and the number of talks published began to increase again and continued to grow. The maximum number of talks published occurred in 2019. However, the data only goes up to April 2020, so we cannot say for certain if the trend of decreasing talks published has continued after that time.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, that's a good point. The number of talks published each year can be used as an indicator of the growth or expansion of the business. If there is a steady increase in the number of talks published, it could be an indication that the business is growing and expanding. However, it is important to keep in mind that there could be other factors affecting the number of talks published, such as changes in the company's policies or resources, which could also impact the growth of the business.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
ted_final=df[['published_year','views']]
ted_final

In [None]:
ted_final['published_year'].unique()

In [None]:
 df3=df.pivot_table(columns='published_year', values='views',aggfunc='sum')
 df3

In [None]:
df3.plot(kind='bar')
plt.title('views by year',fontdict={'fontsize':20,'fontweight':'bold','fontstyle':'oblique'})
plt.ylabel('views',fontdict={'fontsize':20,'fontweight':'normal'})
plt.xlabel('year',fontdict={'fontsize':20,'fontweight':'normal'})
plt.rcParams['figure.figsize'] = (10,10)


In [None]:
dataset['number_of_lang'] = dataset['available_lang'].apply(lambda x: len(x))
snb.distplot(dataset['number_of_lang'])


##### 1. Why did you pick the specific chart?

This graph can provide insight into the popularity of the TED talks each year and can help us understand which years had more viewer engagement.

##### 2. What is/are the insight(s) found from the chart?

This graph shows us that the total number of views of TED talks has been increasing consistently over the years, with a sharp decline in 2020 (up until April), which may be due to the global pandemic and the resulting changes in people's behavior and priorities. It will be interesting to see if this trend continues or if there will be a rebound in views in the coming years.





##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 While the increasing popularity of TED talks is a positive sign for the business, it's important to continue to strive for growth and find ways to engage even more viewers. This could include exploring new topics, featuring diverse speakers, or experimenting with different formats to keep the audience engaged and interested.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
ted=dataset.groupby(['event'],as_index=False).agg({'views':'sum','talk_id':'count'}).sort_values('views',ascending=False).reset_index()[:10]
ted['talk_id']=ted['views']/ted['talk_id']
plt.figure(figsize=(10,6))
ax=snb.barplot(x='event',y='views',data=ted)
labels=ax.get_xticklabels()
plt.title('Top TED Events by views')
plt.ylabel('views in million')
plt.setp(labels, rotation=50);


##### 1. Why did you pick the specific chart?

I have selected this chart to know which event was the most popular among the viewers.

##### 2. What is/are the insight(s) found from the chart?

The insight provided by the chart reveals the top ten TED events by views, with the TED-ED event being the most popular among viewers and We can see that the TED2015 event was the most popular among viewers, followed by TED2014 and TEDGlobal 2013. Knowing which event was the most popular can help in analyzing the factors that contributed to its success and replicate those factors in future events to increase the business.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By analyzing the successful strategies and arrangements of the most popular event, we can apply them to other events and potentially increase their popularity as well. This can ultimately help in the growth of the business.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
dataset['occupations'].nunique()

In [None]:
type('occupations')

In [None]:
a=dataset['occupations'].value_counts().head(10)
print(z)

In [None]:
data = a
df = pd.DataFrame(data)
print(df)



In [None]:
abc=pd.Series()


In [None]:
# Create the plot object
a.plot(kind='bar',color='green',width=0.8)
plt.title('',fontdict={'fontsize':20,'fontweight':'bold','fontstyle':'oblique'})
plt.ylabel('Number of speakers from that occupation',fontdict={'fontsize':20,'fontweight':'normal'})
plt.xlabel('Occupations',fontdict={'fontsize':20,'fontweight':'normal'})
plt.rcParams['figure.figsize'] = (10,10)


##### 1. Why did you pick the specific chart?

 we can analyze which professions have more influence among the viewers.

##### 2. What is/are the insight(s) found from the chart?

 Knowing that the speakers with the highest views come from the writers' occupation and also, we can see that the occupation with the highest views is "journalist," followed by "entrepreneur," and "artist". This can help in planning future events or inviting speakers from similar professions to attract more viewers. It can also give insights into what topics and skills are most popular among the viewers, and tailor the talks accordingly.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, inviting more writers,journalist,entrepreneur as speakers for TED talks could potentially attract more viewers who are interested in literary and intellectual topics. It may also help broaden the diversity of perspectives and insights presented in TED talks. However, it's important to keep in mind that the popularity of a speaker is not solely determined by their occupation, but also by the topic they are presenting and how engaging they are as a speaker.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
dataset[['title', 'speaker_1','views', 'duration','number_of_lang']].sort_values('duration', ascending=False).head(10)

##### 1. Why did you pick the specific chart?

As we have seen, lengthier videos tend to have less viewers, but there are those that have more viewers despite their length, therefore I was curious as to the title, speaker, and views of these films.

##### 2. What is/are the insight(s) found from the chart?

This insight suggests that while shorter talks tend to be more popular, there are exceptions to this rule. In the case of Chris Anderson's talk, the topic of climate change may have been particularly compelling to viewers, leading to its popularity despite its length. This insight suggests that while video length is an important consideration for TED talks, it is not the only factor in determining a talk's success. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, that is a great strategy. Additionally, we could also analyze the content of the video to identify what made it so engaging and use those insights to create more content that resonates with our viewers. It is also important to note that while video length may be a factor, it is not the only factor that determines viewership. We should continue to focus on producing high-quality content that is informative, engaging, and relevant to our audience's interests.






#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
dataset.head()

In [None]:
relaible_data=dataset[['title','speaker_1','occupations','topics']]
relaible_data

##### 1. Why did you pick the specific chart?

I chose this to assess the reliability of the occupation and the title of the speaker's presentation.

##### 2. What is/are the insight(s) found from the chart?

The speaker's occupation and the content of their presentation appear to be reliable, as the talk's title aligns with their profession.





##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This is indeed helpful, as the credibility of the speaker's occupation and their talk is crucial for TED to provide authentic and valuable content. If the speaker's expertise and the content they deliver are not reliable, TED's reputation could be at risk, and the viewers may lose trust, resulting in a negative impact on the business.




#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
df2=pd.DataFrame(dataset)
df2

In [None]:
df2['published_month'] = pd.DatetimeIndex(df2['published_date']).month
df2.head()

In [None]:
df2.loc[:, 'published_month']

In [None]:
month=df2['published_month'].unique()
print(month)
number_of_published_talks=list(df2['published_month'].value_counts(ascending=True).sort_index(ascending=True))
print(number_of_published_talks)

In [None]:
plt.barh(month,number_of_published_talks)
 # setting label of y-axis
plt.ylabel("Month",fontdict={'fontsize':15,'fontweight':'normal','fontstyle':'oblique'})
 # setting label of x-axis
plt.xlabel("number of  published talks",fontdict={'fontsize':15,'fontweight':'normal','fontstyle':'oblique'})
plt.title("Published talks in a month",fontdict={'fontsize':15,'fontweight':'normal','fontstyle':'oblique'})
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart out of curiosity to determine which month had the highest number of talks published by TED.





##### 2. What is/are the insight(s) found from the chart?

September, which is the 9th month of the year, had the highest number of talks published by TED.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Although it may not hold significant importance, the chart displays the trend of talks published by TED on a monthly basis throughout the year.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
dataset2= dataset.copy() 

In [None]:
dataset2.columns


In [None]:
fig, ax = plt.subplots(figsize=(15,10))
snb.heatmap(dataset2.corr(), annot= True, cmap= "autumn",ax=ax)


##### 1. Why did you pick the specific chart?

i have picked up this chart because correlation heatmap visualize the strength of relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Correlation ranges from -1 to +1.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

In [None]:
snb.pairplot(dataset,diag_kind="hist",dropna="True")


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

CONCLUSION-

HYPOTHETICAL STATEMENT-1 

A t-statistic of -114.288 and a p-value of 0.0 suggest that there is strong evidence to reject the null hypothesis and support the alternative hypothesis that the population mean is not equal to the specific value being tested.

HYPOTHETICAL STATEMENT -2 

 A Pearson correlation coefficient of 0.0929 and a very small p-value of 7.517721887816747e-08 suggest that there is a statistically significant, but weak positive linear relationship between the two variables.

 HYPOTHETICAL STATEMENT-3 

A chi-square statistic of 435774.7662906001 and a very small p-value of 6.181054395856784e-08 suggest that there is a statistically significant difference between the observed and expected frequencies, and there is evidence of a true association or goodness of fit.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: There is no relationship between number of comments  and the number of views it receives.


Alternative hypothesis: There is a relationship between number of comments and the number of views it receives.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value


In [None]:
# Check for missing values
print(dataset.isna().sum())

# Drop rows with missing values
dataset.dropna(inplace=True)

# Fill missing values with mean
dataset.fillna(dataset.mean(), inplace=True)

# Fill missing values with median
dataset.fillna(dataset.median(), inplace=True)


In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load the data into a pandas DataFrame

# Extract the two columns of interest
column1 = dataset['comments']
column2 = dataset['views']

# Conduct a two-sample t-test
t_stat, p_val = ttest_ind(column1, column2)

# Print the results
print("T-statistic: ", t_stat)
print("P-value: ", p_val)


##### Which statistical test have you done to obtain P-Value?

I employed a two-sample t-test to obtain the P-value.

##### Why did you choose the specific statistical test?

I employed a two-sample t-test because it can assist in detecting significant differences between two groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: There is no relationship between duration and views of the talk's.


Alternative hypothesis: There is a relationship between duration and views of the talk's.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
import scipy.stats as stats

# Create two arrays of data
x = dataset['duration']
y = dataset['views']

# Calculate the Pearson correlation coefficient and p-value
corr_coef, p_value = stats.pearsonr(x, y)

# Print the results
print('Pearson correlation coefficient:', corr_coef)
print('P-value:', p_value)


##### Which statistical test have you done to obtain P-Value?

I have employed Pearson correlation for finding P-value .

##### Why did you choose the specific statistical test?

The reason for using Pearson correlation coefficient is that it enables us to calculate the p-value when testing for the significance of the correlation between two continuous variables. This method does not require assuming that the data follow a normal distribution or relying on the t-test. Therefore, it provides a robust and reliable way to assess the correlation between two continuous variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: There is no relationship between duration and comments of the talk's.


Alternative hypothesis: There is a relationship between duration and comments of the talk's.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
# Create a contingency table
cont_table = pd.crosstab(dataset['comments'], dataset['duration'])

# Conduct chi-square test
chi2_stat, p_val, dof, exp_freq = stats.chi2_contingency(cont_table)

# Print results
print("Chi-square statistic:", chi2_stat)
print("p-value:", p_val)


##### Which statistical test have you done to obtain P-Value?

I have employed Chi-square test for finding P-value 

##### Why did you choose the specific statistical test?

By calculating the chi-square test statistic and obtaining the associated p-value, we can determine whether the observed association between the two categorical variables is likely due to chance or if it is statistically significant.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
#missing values
dataset.isna().sum()


In [None]:
# missing value percentage
round((dataset.isna().sum() / len(dataset))*100,2)


In [None]:
mn.matrix(dataset,figsize=(10,8));


In [None]:
## Copy of data
df = dataset.copy()
df = df.dropna(axis=0)
df.isna().sum()

In [None]:
print('Dataset Size With Missing Values',dataset.shape)


In [None]:
print('Dataset Size Without Missing Values',df.shape)


#### What all missing value imputation techniques have you used and why did you use those techniques?

I have employed percentage missing value imputation because this method provides a quick way to identify columns with missing values and the extent of missingness, which can be useful for deciding on an appropriate imputation method or deciding to drop the column altogether.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
snb.boxplot(dataset['duration'], data= dataset)


In [None]:
data_duration_iqr=  dataset['duration'].quantile(0.75)-dataset['duration'].quantile(0.25)
dataset['duration']= dataset['duration'].mask(dataset['duration']>(dataset['duration'].quantile(0.75)+1.5*data_duration_iqr), dataset['duration'].mean())



In [None]:
snb.boxplot(dataset['duration'], data= dataset)


In [None]:
snb.boxenplot(dataset['number_of_lang'], data= dataset)


In [None]:
data_duration_iqr=  dataset['number_of_lang'].quantile(0.75)-dataset['number_of_lang'].quantile(0.25)
dataset['number_of_lang']= dataset['number_of_lang'].mask(dataset['number_of_lang']>(dataset['number_of_lang'].quantile(0.75)+1.5*data_duration_iqr), dataset['number_of_lang'].mean())



In [None]:
snb.boxenplot(dataset['number_of_lang'], data= dataset)


In [None]:
snb.boxenplot(dataset['views'], data= dataset)


In [None]:
data_duration_iqr=  dataset['views'].quantile(0.75)-dataset['views'].quantile(0.25)
dataset['views']= dataset['views'].mask(dataset['views']>(dataset['views'].quantile(0.75)+1.5*data_duration_iqr), dataset['views'].mean())



In [None]:
snb.boxenplot(dataset['views'], data= dataset)


In [None]:
snb.boxenplot(dataset['comments'], data= dataset)


In [None]:
data_duration_iqr=  dataset['comments'].quantile(0.75)-dataset['comments'].quantile(0.25)
dataset['comments']= dataset['comments'].mask(dataset['comments']>(dataset['comments'].quantile(0.75)+1.5*data_duration_iqr), dataset['comments'].mean())

In [None]:
snb.boxenplot(dataset['comments'], data= dataset)


##### What all outlier treatment techniques have you used and why did you use those techniques?

I have employed IQR (Interquartile Range) method,Tukey's method,mean value  to identify and replace outliers in the 'duration','num_of_lang','views','comments' variable.

The reason for using this methods to handle outliers in the  columns, which may be due to measurement errors, data entry errors, or true extreme values in the distribution. Outliers can distort the data and affect the performance of models that rely on assumptions of normality and homoscedasticity.

IQR=Calculates the interquartile range (IQR) of the columns, which is the difference between the third quartile (75th percentile) and the first quartile (25th percentile). 

Tukey's method is a commonly used method for identifying and replacing outliers in a dataset. 

By replacing the outliers with the mean value of the column, i am filling in missing data with a reasonable estimate that should not substantially impact the results of any subsequent analyses.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
!pip install category_encoders


In [None]:
df_numeric = df[['talk_id', 'views', 'comments', 'duration', 'number_of_lang']]
df_categorical = df[[ 'speaker_1','native_lang','topics',]]


In [None]:
df_numeric.head()

In [None]:
df_categorical.head()

In [None]:
print(df['speaker_1'].unique())
print(df['native_lang'].unique())
print(df['topics'].unique())

In [None]:
from sklearn.preprocessing import LabelEncoder

speaker_encoder = LabelEncoder()
speaker_encoder.fit(df_categorical['speaker_1'])
speaker_values = speaker_encoder.transform(df_categorical['speaker_1'])
print("Before Encoding:", list(df_categorical['speaker_1'][-10:]))
print("After Encoding:", speaker_values[-10:])
print("The inverse from the encoding result:", speaker_encoder.inverse_transform(speaker_values[-10:]))



In [None]:
from sklearn.preprocessing import OneHotEncoder

native_lang_encoder = OneHotEncoder()
native_lang_reshaped = np.array(df_categorical['native_lang']).reshape(-1, 1)
native_lang_values = native_lang_encoder.fit_transform(native_lang_reshaped)

print(df_categorical['native_lang'][:5])
print()
print(native_lang_values.toarray()[:5])
print()
print(native_lang_encoder.inverse_transform(native_lang_values)[:5])


In [None]:
from sklearn.preprocessing import LabelEncoder

topics_encoder = LabelEncoder()
topics_encoder.fit(df_categorical['topics'])
topics_values = topics_encoder.transform(df_categorical['topics'])
print("Before Encoding:", list(df_categorical['topics'][-10:]))
print("After Encoding:", topics_values[-10:])
print("The inverse from the encoding result:", topics_encoder.inverse_transform(topics_values[-10:]))



In [None]:
speaker=dataset.groupby('speaker_1').agg({'views' : 'mean'}).sort_values(['views'],ascending=False)
speaker=speaker.to_dict()
speaker=speaker.values()
speaker=  list(speaker)[0]
dataset['speaker_1_avg_views']=dataset['speaker_1'].map(speaker)
plt.figure(figsize=(10,5))
snb.distplot(dataset['speaker_1_avg_views'])



distribution plot of the average views for each speaker in the dataset.

#### What all categorical encoding techniques have you used & why did you use those techniques?

i have employed LabelEncoder in this case  to convert the categorical values of the 'speaker_1' column into numerical values that can be used as input for machine learning algorithms that require numeric input. 

and  I have also employed One hot encoding  that creates a numerical representation of the categorical data that can be used in many machine learning models.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
df_numeric = df[['talk_id', 'views', 'comments', 'duration', 'number_of_lang']]
df_categorical = df[[ 'title','speaker_1','all_speakers','occupations','topics']]


In [None]:
# Expand Contraction
pd.set_option('display.max_colwidth', -1)
data= df_categorical
data.head()

#### 2. Lower Casing

In [None]:
# Lower Casing
data['title'] = data['title'].str.lower()
data['speaker_1'] = data['speaker_1'].str.lower()
data['all_speakers'] = data['all_speakers'].str.lower()
data['occupations'] = data['occupations'].str.lower()
data['topics'] = data['topics'].str.lower()


In [None]:
new_data = data.assign(title = data['title'].str.lower(),
speaker_1 = data['speaker_1'].str.lower(),
all_speakers= data['all_speakers'].str.lower(),occupations = data['occupations'].str.lower(),
topics= data['topics'].str.lower())

# display the first few rows of the new DataFrame
new_data

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
new_data2= new_data.apply(lambda x:remove_punctuation(x))
new_data2


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

In [None]:

import re
# remove URLs from the text in the 'text' column
new_data['title'] = new_data['title'].apply(lambda x: re.sub(r'http\S+', '', x))
new_data['speaker_1'] = new_data['speaker_1'].apply(lambda x: re.sub(r'http\S+', '', x))
new_data['all_speakers'] = new_data['all_speakers'].apply(lambda x: re.sub(r'http\S+', '', x))
new_data['occupations'] = new_data['occupations'].apply(lambda x: re.sub(r'http\S+', '', x))
new_data['topics'] = new_data['topics'].apply(lambda x: re.sub(r'http\S+', '', x))

# remove words containing digits from the text in the 'text' column
new_data['title'] = new_data['title'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
new_data['speaker_1'] = new_data['speaker_1'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
new_data['all_speakers'] = new_data['all_speakers'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
new_data['occupations'] = new_data['occupations'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
new_data['topics'] = new_data['topics'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))

# display the first few rows of the updated DataFrame
new_data


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
nltk.download('stopwords')
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
[ "youve"]
#defining the function to remove stopwords from tokenized text
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output


In [None]:
# Remove White spaces

In [None]:
new_data['title'] = new_data['title'].str.strip()
new_data['speaker_1'] = new_data['speaker_1'].str.strip()
new_data['all_speakers'] = new_data['all_speakers'].str.strip()
new_data['occupations'] = new_data['occupations'].str.strip()
new_data['topics'] = new_data['topics'].str.strip()

In [None]:
new_data['title'] = new_data['title'].apply(lambda x: " ".join(x.split()))
new_data['speaker_1'] = new_data['speaker_1'].apply(lambda x: " ".join(x.split()))
new_data['all_speakers'] = new_data['all_speakers'].apply(lambda x: " ".join(x.split()))
new_data['occupations'] = new_data['occupations'].apply(lambda x: " ".join(x.split()))
new_data['topics'] = new_data['topics'].apply(lambda x: " ".join(x.split()))

In [None]:
new_data

#### 6. Rephrase Text

In [None]:
# Rephrase Text

In [None]:
nltk.download('wordnet')


#### 7. Tokenization

In [None]:
# Tokenization
import re
def tokenization(text):
    tokens = re.split('W+',text)
    return tokens
#applying function to the column
new_data['title'] = new_data['title'].apply(lambda x: tokenization(x))
new_data['speaker_1'] = new_data['speaker_1'].apply(lambda x: tokenization(x))
new_data['all_speakers'] = new_data['all_speakers'].apply(lambda x: tokenization(x))
new_data['occupations'] = new_data['occupations'].apply(lambda x: tokenization(x))
new_data['topics'] = new_data['topics'].apply(lambda x: tokenization(x))



#### 8. Text Normalization

In [None]:
#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer

#defining the object for stemming
porter_stemmer = PorterStemmer()

#defining a function for stemming
def stemming(text):
    stem_text = [porter_stemmer.stem(word) for word in text]
    return stem_text

new_data['title'] = new_data['title'].apply(lambda x: stemming(x))
new_data['speaker_1'] = new_data['speaker_1'].apply(lambda x: stemming(x))
new_data['all_speakers'] = new_data['all_speakers'].apply(lambda x: stemming(x))
new_data['occupations'] = new_data['occupations'].apply(lambda x: stemming(x))
new_data['topics'] = new_data['topics'].apply(lambda x: stemming(x))



##### Which text normalization technique have you used and why?

Porter stemming algorithm from the nltk library to stem each word in the input text.

#### 9. Part of speech tagging

In [None]:
# POS Taging


In [None]:
import nltk
import pandas as pd

# download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# create a sample DataFrame with text data in a column
df = pd.DataFrame({'text': ['title','speaker_1','all_speakers','occupations','topics']})

# define a function to perform POS tagging on a given text string
def pos_tag(text):
    tokens = nltk.word_tokenize(text)
    return nltk.pos_tag(tokens)

# apply the pos_tag function to the 'text' column of the DataFrame
df['pos_tags'] = df['text'].apply(pos_tag)

# print the resulting DataFrame
print(df)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# create a DataFrame with text data in a column
df = pd.DataFrame({'text': ['title','speaker_1','all_speakers','occupations','topics']})

# create the vectorizer object
vectorizer = CountVectorizer()

# fit the vectorizer on the text column and transform it into vectors
vectors = vectorizer.fit_transform(df['text'])

# create a new DataFrame with the vectorized representation of the text data
vectorized_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())

# print the vectorized DataFrame
print(vectorized_df)


##### Which text vectorization technique have you used and why?

CountVectorizer technique  to convert text data into a numerical representation, specifically a sparse matrix of token counts.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features


In [None]:
dataset['recorded_date'] = pd.to_datetime(dataset['recorded_date'])
dataset['published_date'] = pd.to_datetime(dataset['published_date'])

# Calculate the duration of days taken to publish the video
dataset['published_duration_days'] = (dataset['published_date'] - dataset['recorded_date']).dt.days

# Print the first five rows of the updated dataset
print(dataset.head())

In [None]:
dataset['published_duration_days']

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting


In [None]:
dataset.columns

In [None]:
unwanted_features=['talk_id', 'title', 'speaker_1', 'all_speakers', 'occupations',
       'about_speakers', 'views', 'recorded_date', 'published_date', 'event',
       'native_lang', 'available_lang', 'topics',
       'related_talks', 'url', 'description', 'transcript',]
     

In [None]:
df4=dataset.copy()

In [None]:
print(df4.head())

In [None]:
#dropping unimportant columns from the datasets.
df4.drop(columns=unwanted_features,inplace=True)
     

In [None]:
df4.columns

In [None]:
x=df4

In [None]:
#one hot encoding on categorical features
X=pd.get_dummies(x)
X.shape

In [None]:
X.head(2)


In [None]:

#checking for null values
X[['duration','comments']].isna().sum()

In [None]:
X.head()

In [None]:
import xgboost as xgb
xgb_model= xgb.XGBRegressor(objective="reg:squarederror")

In [None]:
dataset.columns

In [None]:
X= dataset[['duration',
       'speaker_1_avg_views', 
       'number_of_lang']]

In [None]:
from sklearn.preprocessing import StandardScaler
Scaler= StandardScaler()

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***