<a href="https://www.kaggle.com/code/errich/the-analysis-of-the-global-youtube-subscribers?scriptVersionId=142905085" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# This is a notebook to analyze the global youtube statistics dataset. 
Welcome to the captivating realm of YouTube stardom, where this meticulously curated dataset unveils the statistics of the most subscribed YouTube channels. A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators' subscriber counts, video views, upload frequency, country of origin, earnings, and more, this treasure trove of information is a must-explore for aspiring content creators, data enthusiasts, and anyone intrigued by the ever-evolving online content landscape. Immerse yourself in the world of YouTube success and unlock a wealth of knowledge with this extraordinary dataset.

# Imports for this notebook
 #### pandas for data analysis
 #### Learn about pandas: https://pandas.pydata.org/docs/  
 #### [](http://)To begin working with pandas, import the pandas Python package as shown below. When importing pandas, the most common alias for pandas is pd. we are also importing plotly for data visualization

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as pxg




# Reading the data 
Use read_csv() with the path to the CSV file to read a comma-separated values. \
You can also use read_excel to import an excel file. 

In [None]:
df = pd.read_csv("/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv", encoding="latin-1")

# Viewing and understanding DataFrames using pandas 
In this section, we will view a sample and a summary of our data. 
.head() to view first few rows and .tail() for the last few rows
We can also utilize .info() .describe() .shape for more detailed information. 

### View the first few or last few rows. we can specify an argument n to view n number of rows, by default n is 5. 

In [None]:
df.head()

In [None]:
df.tail()

### .shape can be used to get the number of rows and columns.  

In [None]:
df.shape

### The .info() method is a quick way to look at the data types, missing values, and data size 

In [None]:
df.info()

### The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.

In [None]:
df.describe()

## Get columns and column names

In [None]:
df.columns

 ### Checking for missing values in pandas with .isnull(). To get the sum of missing values for every column, we can apply sum() to the result of .isnull 

In [None]:
df.isnull().sum()

# Slicing and Extracting Data in pandas

### [ ] is used to get a single column. [[ ]] is used to get more than one column 

In [None]:
df['Youtuber']

In [None]:
df[['Youtuber', 'Country']]

### Using .loc[] and .iloc[] to fetch rows, columns. .loc[] uses a label to point to a row, column or cell, whereas .iloc[] uses the numeric position.

In [None]:
df.loc[0:7, 'Youtuber']

In [None]:
df.iloc[0:8, 1]

### Conditional slicing (that fits certain conditions)

In [None]:
df[df.Youtuber == "T-Series"]

#### The below example will fetch all youtube channels with more than 100 millions subscribers 

In [None]:
df[df.subscribers > 100000000]

# Cleaning data using pandas 


### As a reminder, we can display the numbers of missing data using below code

In [None]:
df.isnull().sum()

### We can use dropna to remove null values or we can replace them with some other numbers.

In [None]:
df1 = df.dropna()
print("Shape of our dataframe before droping null values", df.shape)
print("Shape of our dataframe after droping null values", df1.shape)

# Data analysis in pandas


In [None]:
df.mean(numeric_only=True)

In [None]:
df.median(numeric_only=True)

### .value_counts() can be used to count the numbers of occurence of a value in a column. For example, we can find occurences of countries. 

In [None]:
df["Country"].value_counts()

### Aggregating data with .groupby() in pandas


In [None]:
df.groupby("Country").mean(numeric_only=True)

# Data visualization in pandas


### Entertainment is the leading category followed by music. 

In [None]:
df["category"].value_counts().plot.bar()

### The below bar/scatter plot represents numbers of youtubers by country. We can see the the united states has the largest portion with more than 300 youtubers. We can add a drop down menu to switch between the 2 plots

In [None]:
data = df["Country"].value_counts()

In [None]:
plot = pxg.Figure(data=[pxg.Scatter(
    x=data.index,
    y=data.values,
    mode='markers',)
])
 
# Add dropdown
plot.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=["type", "scatter"],
                    label="Scatter Plot",
                    method="restyle"
                ),
                dict(
                    args=["type", "bar"],
                    label="Bar Chart",
                    method="restyle"
                )
            ]),
        ),
    ]
)
 
plot.show()

### Box plot to display number of created channels by year. 

In [None]:
df["created_year"].value_counts().plot.bar()

In [None]:
fig = px.scatter_geo(df,lat='Latitude',lon='Longitude')
fig.show()
