To clean this dataset, let's import the python libraries we'll be using:

In [1]:
import pandas as pd
import pandasql as ps
import numpy

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Next, we'll read in the second csv file. We'll call this dataframe 'df_newstats' since it contains the data on the most recent trending YouTubers:

In [2]:
df_newstats = pd.read_csv('datadump/Youtubers.csv')

df_newstats.head()

Unnamed: 0,Rank,Channel Name,Category,Subscribers,Country,Average Views,Average Likes,Average Comments,Content Type
0,,,,,,,,,
1,1.0,T-Series,Music & Dance,258.4M,India,135.2K,5.6K,223,
2,2.0,MrBeast,Video games,236.1M,United States,104M,4M,74K,Humor
3,3.0,Cocomelon - Nursery Rhymes,Education,171.4M,United States,5.1M,57.1K,0,
4,4.0,SET India,,167.1M,India,27.9K,996,7,


I already see we have some null values we'll have to take care of, but let's get an overview on the number of columns and rows before cleaning the data:

In [3]:
cols = len(df_newstats.axes[1])
rows = len(df_newstats.axes[0])
print("Number of columns before cleaning data: ", cols)
print("Number of rows before cleaning data: ", rows)

Number of columns before cleaning data:  9
Number of rows before cleaning data:  1046


In [4]:
#start the data cleaning by deleting the first null value row
df_newstats.drop(df_newstats.index[0], axis=0, inplace=True)

#then reducing our data to the top 50 Youtubers. Since we dropped that first null row, we now only have 1045 rows:
df_newstats.drop(df_newstats.index[range(50, 1045)], axis=0, inplace=True)

#converting decimals under rank column to integers, 
#resetting default index after removing first row,
#then setting the dataframe options to display ALL cols and rows.
df_newstats['Rank'] = df_newstats['Rank'].astype(int)
df_newstats.reset_index(inplace = True, drop = True)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df_newstats.head(None)

Unnamed: 0,Rank,Channel Name,Category,Subscribers,Country,Average Views,Average Likes,Average Comments,Content Type
0,1,T-Series,Music & Dance,258.4M,India,135.2K,5.6K,223,
1,2,MrBeast,Video games,236.1M,United States,104M,4M,74K,Humor
2,3,Cocomelon - Nursery Rhymes,Education,171.4M,United States,5.1M,57.1K,0,
3,4,SET India,,167.1M,India,27.9K,996,7,
4,5,✿ Kids Diana Show,Animation,118.5M,,5.1M,14.3K,0,Toys
5,6,Like Nastya,Toys,112.6M,,3.4M,45.9K,0,
6,7,PewDiePie,Movies,111.6M,United States,1.7M,109.2K,3.1K,Video games
7,8,Vlad and Niki,Animation,109.6M,,5.8M,42.7K,0,Toys
8,9,Zee Music Company,Music & Dance,104.5M,India,38K,1.6K,26,
9,10,WWE,Video games,99M,United States,182.9K,6.8K,227,


In [5]:
#now let's drop repeat columns we won't need when combining both tables to make comparisons
df_newstats = df_newstats.drop(columns = ['Category', 'Country', 'Average Views', 'Content Type'])

#let's display the column names again
df_newstats.columns.values

array(['Rank', 'Channel Name', 'Subscribers', 'Average Likes',
       'Average Comments'], dtype=object)

In [6]:
#now comparing our values to the old dataset and after doing some research, I realized BLACKPINK, Justin Bieber, Ed Sheeran, and Taylor Swift were missing by mistake; 
#as a swifty would say, "I don't know about you, but I'm feeling data, too!"
#or as a sheeraner would say, "I'm in love with the shape of data!"
#or as a BLACKPINK fan would say, "Let's kill this dataset!"
#or as a Belieber would say, "Is it too late now to save data?"
#now that I've had my fun, let's add them back to the dataset!
df_newstats.loc[len(df_newstats)] = {'Rank': 12, 'Channel Name' : 'BLACKPINK', 'Subscribers' : '93.2M', 'Average Likes' : '439.2k', 'Average Comments' : '16.7k'}
df_newstats.loc[len(df_newstats)] = {'Rank': 17, 'Channel Name' : 'Justin Bieber', 'Subscribers' : '72.6M', 'Average Likes' : '211.2k', 'Average Comments' : '7.4k'}
df_newstats.loc[len(df_newstats)] = {'Rank': 36, 'Channel Name' : 'Ed Sheeran', 'Subscribers' : '54.3M', 'Average Likes' : '15.5k', 'Average Comments' : '364'}
df_newstats.loc[len(df_newstats)] = {'Rank': 34, 'Channel Name' : 'Taylor Swift', 'Subscribers' : '56.6M', 'Average Likes' : '188.5k', 'Average Comments' : '6k'}

In [7]:
df_newstats.head(None)

Unnamed: 0,Rank,Channel Name,Subscribers,Average Likes,Average Comments
0,1,T-Series,258.4M,5.6K,223
1,2,MrBeast,236.1M,4M,74K
2,3,Cocomelon - Nursery Rhymes,171.4M,57.1K,0
3,4,SET India,167.1M,996,7
4,5,✿ Kids Diana Show,118.5M,14.3K,0
5,6,Like Nastya,112.6M,45.9K,0
6,7,PewDiePie,111.6M,109.2K,3.1K
7,8,Vlad and Niki,109.6M,42.7K,0
8,9,Zee Music Company,104.5M,1.6K,26
9,10,WWE,99M,6.8K,227


In [8]:
#now we'll need to sort based on subscriber count to get them in the right order again
#Also rename channel name column to youtuber
