# Data Cleaning - Youtube Channel Dataset

### Import pandas library.

In [1]:
import pandas as pd

### Read the "top-5000-youtube-channels.csv" file and store it in a variable.

In [2]:
df = pd.read_csv('top-5000-youtube-channels.csv')

### Display the first five rows of DataFrame.

In [3]:
df.head()

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
0,1st,A++,Zee TV,82757,18752951,20869786591
1,2nd,A++,T-Series,12661,61196302,47548839843
2,3rd,A++,Cocomelon - Nursery Rhymes,373,19238251,9793305082
3,4th,A++,SET India,27323,31180559,22675948293
4,5th,A++,WWE,36756,32852346,26273668433


### Find total records in column and column datatype.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           5000 non-null   object
 1   Grade          5000 non-null   object
 2   Channel name   5000 non-null   object
 3   Video Uploads  5000 non-null   object
 4   Subscribers    5000 non-null   object
 5   Video views    5000 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


### Convert the datatype of column **Subscribers** to `int`. If there are any errors, try to solve the problem.

In [5]:
# df["Subscribers"].astype("int")
# # ValueError: invalid literal for int() with base 10: '-- '

In [6]:
df["Subscribers"] = df["Subscribers"].str.replace("--", "0", regex=False)
df["Subscribers"] = df["Subscribers"].astype("int")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           5000 non-null   object
 1   Grade          5000 non-null   object
 2   Channel name   5000 non-null   object
 3   Video Uploads  5000 non-null   object
 4   Subscribers    5000 non-null   int64 
 5   Video views    5000 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 234.5+ KB


### Convert the datatype of column **Video Uploads** to `int`. If there are any errors, try to solve the problem.

In [7]:
# df["Video Uploads"].astype("int")
# # ValueError: invalid literal for int() with base 10: '--'

In [8]:
df["Video Uploads"] = df["Video Uploads"].str.replace("--", "0", regex=False)
df["Video Uploads"] = df["Video Uploads"].astype("int")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           5000 non-null   object
 1   Grade          5000 non-null   object
 2   Channel name   5000 non-null   object
 3   Video Uploads  5000 non-null   int64 
 4   Subscribers    5000 non-null   int64 
 5   Video views    5000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 234.5+ KB


### Convert the datatype of column **Rank** to `int`. If there are any errors, try to solve the problem.

In [9]:
# df["Rank"].astype("int")
# # ValueError: invalid literal for int() with base 10: '1st'

In [10]:
df['Rank'] = df['Rank'].str[0:-2].str.replace(',','', regex=False)
df['Rank'] = df['Rank'].astype('int')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           5000 non-null   int64 
 1   Grade          5000 non-null   object
 2   Channel name   5000 non-null   object
 3   Video Uploads  5000 non-null   int64 
 4   Subscribers    5000 non-null   int64 
 5   Video views    5000 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 234.5+ KB


### Which are unique values are there in the **Grade** column?

In [11]:
df['Grade'].unique()

array(['A++ ', 'A+ ', 'A ', '\xa0 ', 'A- ', 'B+ '], dtype=object)

### Drop the rows which have the abnormal values in **Grade** column and check the unique values, again.

In [12]:
indices = df[df["Grade"] == '\xa0 '].index
df.drop(index=indices, inplace=True)
df['Grade'].unique()

array(['A++ ', 'A+ ', 'A ', 'A- ', 'B+ '], dtype=object)

### Replace the values of the **Grade** column with appropriate integer values.

In [13]:
channel_map = {'A++ ': 5, 'A+ ': 4, 'A ': 3, 'A-': 2, 'B+': 1}
df['Grade'] = df['Grade'].map(channel_map)
df["Grade"]

0       5.0
1       5.0
2       5.0
3       5.0
4       5.0
       ... 
4995    NaN
4996    NaN
4997    NaN
4998    NaN
4999    NaN
Name: Grade, Length: 4994, dtype: float64

### Display the datatypes of columns.

In [16]:
df.dtypes

Rank               int64
Grade            float64
Channel name      object
Video Uploads      int64
Subscribers        int64
Video views        int64
dtype: object

### Display the first five rows of DataFrame, again.

In [15]:
df.head()

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
0,1,5.0,Zee TV,82757,18752951,20869786591
1,2,5.0,T-Series,12661,61196302,47548839843
2,3,5.0,Cocomelon - Nursery Rhymes,373,19238251,9793305082
3,4,5.0,SET India,27323,31180559,22675948293
4,5,5.0,WWE,36756,32852346,26273668433
