# What is Web Scraping?
The process of extracting content and data from a website.

# Scraping HTML Tables with Pandas 

In [3]:
import pandas as pd
import numpy as np

### If all you're interested in is some tables from a table, you don’t actually need to set up a whole scraper to do it because Pandas can do it for us. The `pandas.read_html()` function can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML. 

In [4]:
tiktok = pd.read_html("https://en.wikipedia.org/wiki/List_of_most-followed_TikTok_accounts")
#  You just need to pass the URL of the page

### All you need to do now is to select the DataFrame you want from the list of tables. Since it was the first table on this page, it'll be index 0

In [5]:
tokDF = tiktok[0]
tokDF

Unnamed: 0,Rank,Username,Owner,Followers[8](millions),Likes[8](millions),Description,Country,Brand Account
0,1,@charlidamelio,Charli D'Amelio,113.9,9200,Dancer and social media personality,United States,—
1,2,@addisonre,Addison Rae,79.9,5100,Dancer and social media personality,United States,—
2,3,@bellapoarch,Bella Poarch,63.8,1400,Social media personality,United States,—
3,4,@zachking,Zach King,58.9,724,Filmmaker and social media personality,United States,—
4,5,@tiktok,TikTok,53,250,Social media platform,United States,
5,6,@spencerx,Spencer Polanco Knight,52.7,1300,Beatboxer and social media personality,United States,—
6,7,@willsmith,Will Smith,52.6,315,Actor,United States,—
7,8,@lorengray,Loren Gray,52.1,2800,"Singer, dancer, and social media personality",United States,—
8,9,@dixiedamelio,Dixie D'Amelio,51.3,2900,Singer and social media personality,United States,—
9,10,@justmaiko,Michael Le,47.4,1300,Dancer and social media personality,United States,—


### Check how many rows and columns there are in `tokDF`

In [6]:
tokDF.shape

(51, 8)

In [8]:
tokDF.describe()

Unnamed: 0,Rank,Username,Owner,Followers[8](millions),Likes[8](millions),Description,Country,Brand Account
count,51,51,51,51.0,51,51,51,47
unique,51,51,51,46.0,39,23,13,2
top,33,@flighthouse,Chase Hudson,36.2,1100,Social media personality,United States,—
freq,1,1,1,2.0,4,21,31,46


### What are the column names? Print out the answer in the format "The column names are: X, Y, and Z." 

In [5]:
columnList = tokDF.columns

In [6]:
print("The column names are: ", end= "")
for i in range(len(columnList)):
    if i < len(columnList)-1:
        print(columnList[i], end = ", ")
    else:
        print("and", columnList[i], end = ".")

The column names are: Rank, Username, Owner, Followers[8](millions), Likes[8](millions), Description, Country, and Brand Account.

### Rename the columns so that the `[8]` is removed from the names.

In [7]:
tokDF.columns = ['Rank', 'Username', 'Owner', 'Followers (millions)',
       'Likes (millions)', 'Description', 'Country', 'Brand Account']

### Check the last row of the dataframe.

In [8]:
tokDF.tail(1)

Unnamed: 0,Rank,Username,Owner,Followers (millions),Likes (millions),Description,Country,Brand Account
50,"As of April 26, 2021","As of April 26, 2021","As of April 26, 2021","As of April 26, 2021","As of April 26, 2021","As of April 26, 2021","As of April 26, 2021","As of April 26, 2021"


### Drop the last row from the dataframe.

In [9]:
tokDF = tokDF[:-1]
tokDF

Unnamed: 0,Rank,Username,Owner,Followers (millions),Likes (millions),Description,Country,Brand Account
0,1,@charlidamelio,Charli D'Amelio,113.9,9200,Dancer and social media personality,United States,—
1,2,@addisonre,Addison Rae,79.9,5100,Dancer and social media personality,United States,—
2,3,@bellapoarch,Bella Poarch,63.8,1400,Social media personality,United States,—
3,4,@zachking,Zach King,58.9,724,Filmmaker and social media personality,United States,—
4,5,@tiktok,TikTok,53.0,250,Social media platform,United States,
5,6,@spencerx,Spencer Polanco Knight,52.7,1300,Beatboxer and social media personality,United States,—
6,7,@willsmith,Will Smith,52.6,315,Actor,United States,—
7,8,@lorengray,Loren Gray,52.1,2800,"Singer, dancer, and social media personality",United States,—
8,9,@dixiedamelio,Dixie D'Amelio,51.3,2900,Singer and social media personality,United States,—
9,10,@justmaiko,Michael Le,47.4,1300,Dancer and social media personality,United States,—


In [10]:
tokDF.isnull().sum()

Rank                    0
Username                0
Owner                   0
Followers (millions)    0
Likes (millions)        0
Description             0
Country                 0
Brand Account           4
dtype: int64

### What is the average follower count (in millions)? If it doesn't work at first, check the data types, use pandas' `.to_numeric` function on the data, and try again.

In [11]:
# Check data types
tokDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rank                  50 non-null     object
 1   Username              50 non-null     object
 2   Owner                 50 non-null     object
 3   Followers (millions)  50 non-null     object
 4   Likes (millions)      50 non-null     object
 5   Description           50 non-null     object
 6   Country               50 non-null     object
 7   Brand Account         46 non-null     object
dtypes: object(8)
memory usage: 3.2+ KB


In [12]:
# Change column's data type and check to make sure it's changed 
tokDF['Followers (millions)'] = pd.to_numeric(tokDF['Followers (millions)'])
tokDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Rank                  50 non-null     object 
 1   Username              50 non-null     object 
 2   Owner                 50 non-null     object 
 3   Followers (millions)  50 non-null     float64
 4   Likes (millions)      50 non-null     object 
 5   Description           50 non-null     object 
 6   Country               50 non-null     object 
 7   Brand Account         46 non-null     object 
dtypes: float64(1), object(7)
memory usage: 3.2+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokDF['Followers (millions)'] = pd.to_numeric(tokDF['Followers (millions)'])


### Now find the average follower count (in millions) rounded to three decimal points.

In [13]:
round(tokDF['Followers (millions)'].mean(),3)

37.068

### Drop the last column of `tokDF`

In [14]:
tokDF.drop('Brand Account', axis='columns', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


### Find the average follower count (in millions) for each country, to the nearest whole number (remember that `.astype` can be used on a series object to change every item in the array.)

In [15]:
tokDF['Followers (millions)'] = tokDF['Followers (millions)'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokDF['Followers (millions)'] = tokDF['Followers (millions)'].astype(float)


In [16]:
tokdf2 = tokDF.groupby('Country')['Followers (millions)'].agg(['mean'])

In [17]:
tokdf2.astype(int)

Unnamed: 0_level_0,mean
Country,Unnamed: 1_level_1
Aruba,32
Canada,26
Colombia,25
Germany,29
India,31
Italy,26
Japan,29
Mexico,33
South Korea,32
Spain,25


### Make a dictionary containing usernames as keys and followers (in millions) as the values. Use `.to_dict()`

In [20]:
userDF = tokDF.set_index('Username')
userDict = userDF['Followers (millions)'].to_dict()

### Read in `countries of the world.csv` as `countryDF`

In [21]:
countryDF = pd.read_csv("countries of the world.csv")
countryDF

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,000,2306,16307,700.0,360,32,1213,022,8765,1,466,2034,038,024,038
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,0232,0188,0579
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,004,-039,31,6000.0,700,781,322,025,9653,1,1714,461,0101,06,0298
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,000,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,West Bank,NEAR EAST,2460492,5860,4199,000,298,1962,800.0,,1452,169,1897,6413,3,3167,392,009,028,063
223,Western Sahara,NORTHERN AFRICA,273008,266000,10,042,,,,,,002,0,9998,1,,,,,04
224,Yemen,NEAR EAST,21456188,527970,406,036,0,615,800.0,502,372,278,024,9698,1,4289,83,0135,0472,0393
225,Zambia,SUB-SAHARAN AFRICA,11502010,752614,153,000,0,8829,800.0,806,82,708,003,929,2,41,1993,022,029,0489


In [22]:
countryDF.columns

Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service'],
      dtype='object')

### Check how many nulls there are for each column

In [23]:
countryDF.isnull().sum()

Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          3
Infant mortality (per 1000 births)     3
GDP ($ per capita)                     1
Literacy (%)                          18
Phones (per 1000)                      4
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
Climate                               22
Birthrate                              3
Deathrate                              4
Agriculture                           15
Industry                              16
Service                               15
dtype: int64

### Check which countries have null values for `Literacy`

In [24]:
countryDF[countryDF['Literacy (%)'].isnull()]

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
25,Bosnia & Herzegovina,EASTERN EUROPE,4498976,51129,880,4,31.0,2105.0,6100.0,,2154.0,136.0,296.0,8344.0,4.0,877.0,827.0,142.0,308.0,55.0
66,Faroe Islands,WESTERN EUROPE,47246,1399,338,7984,141.0,624.0,22000.0,,5038.0,214.0,0.0,9786.0,,1405.0,87.0,27.0,11.0,62.0
74,Gaza Strip,NEAR EAST,1428757,360,39688,1111,16.0,2293.0,600.0,,2443.0,2895.0,2105.0,50.0,3.0,3945.0,38.0,3.0,283.0,687.0
78,Gibraltar,WESTERN EUROPE,27928,7,39897,17143,0.0,513.0,17500.0,,8777.0,0.0,0.0,100.0,,1074.0,931.0,,,
80,Greenland,NORTHERN AMERICA,56361,2166086,0,204,-837.0,1582.0,20000.0,,4489.0,0.0,0.0,100.0,1.0,1593.0,784.0,,,
85,Guernsey,WESTERN EUROPE,65409,78,8386,6410,384.0,471.0,20000.0,,8424.0,,,,3.0,881.0,1001.0,3.0,1.0,87.0
99,Isle of Man,WESTERN EUROPE,75441,572,1319,2797,536.0,593.0,21000.0,,6760.0,9.0,0.0,91.0,3.0,1105.0,1119.0,1.0,13.0,86.0
104,Jersey,WESTERN EUROPE,91084,116,7852,6034,276.0,524.0,24800.0,,8113.0,0.0,0.0,100.0,3.0,93.0,928.0,5.0,2.0,93.0
108,Kiribati,OCEANIA,105432,811,1300,14094,0.0,4852.0,800.0,,427.0,274.0,5068.0,4658.0,2.0,3065.0,826.0,89.0,242.0,668.0
123,Macedonia,EASTERN EUROPE,2050554,25333,809,0,-145.0,1009.0,6700.0,,2600.0,2226.0,181.0,7593.0,3.0,1202.0,877.0,118.0,319.0,563.0


### Create a subset of `countryDF` which contains only `Country`, `Population`, and `Literacy` data

In [26]:
sub_df = countryDF[["Country", "Population", "Literacy (%)"]]
sub_df

Unnamed: 0,Country,Population,Literacy (%)
0,Afghanistan,31056997,360
1,Albania,3581655,865
2,Algeria,32930091,700
3,American Samoa,57794,970
4,Andorra,71201,1000
...,...,...,...
222,West Bank,2460492,
223,Western Sahara,273008,
224,Yemen,21456188,502
225,Zambia,11502010,806


### Use pandas' `merge()` to combine the TikTok data with the subset of the country data you made. 

In [27]:
merger = tokDF.merge(sub_df, how="left")
merger

Unnamed: 0,Rank,Username,Owner,Followers (millions),Likes (millions),Description,Country,Population,Literacy (%)
0,1,@charlidamelio,Charli D'Amelio,113.9,9200,Dancer and social media personality,United States,,
1,2,@addisonre,Addison Rae,79.9,5100,Dancer and social media personality,United States,,
2,3,@bellapoarch,Bella Poarch,63.8,1400,Social media personality,United States,,
3,4,@zachking,Zach King,58.9,724,Filmmaker and social media personality,United States,,
4,5,@tiktok,TikTok,53.0,250,Social media platform,United States,,
5,6,@spencerx,Spencer Polanco Knight,52.7,1300,Beatboxer and social media personality,United States,,
6,7,@willsmith,Will Smith,52.6,315,Actor,United States,,
7,8,@lorengray,Loren Gray,52.1,2800,"Singer, dancer, and social media personality",United States,,
8,9,@dixiedamelio,Dixie D'Amelio,51.3,2900,Singer and social media personality,United States,,
9,10,@justmaiko,Michael Le,47.4,1300,Dancer and social media personality,United States,,


### Why might you be getting `NaN`s in the merged dataframe? Use `.tolist()` on the Country values to see a list of all the countries.

In [28]:
sub_df['Country'].tolist()

['Afghanistan ',
 'Albania ',
 'Algeria ',
 'American Samoa ',
 'Andorra ',
 'Angola ',
 'Anguilla ',
 'Antigua & Barbuda ',
 'Argentina ',
 'Armenia ',
 'Aruba ',
 'Australia ',
 'Austria ',
 'Azerbaijan ',
 'Bahamas, The ',
 'Bahrain ',
 'Bangladesh ',
 'Barbados ',
 'Belarus ',
 'Belgium ',
 'Belize ',
 'Benin ',
 'Bermuda ',
 'Bhutan ',
 'Bolivia ',
 'Bosnia & Herzegovina ',
 'Botswana ',
 'Brazil ',
 'British Virgin Is. ',
 'Brunei ',
 'Bulgaria ',
 'Burkina Faso ',
 'Burma ',
 'Burundi ',
 'Cambodia ',
 'Cameroon ',
 'Canada ',
 'Cape Verde ',
 'Cayman Islands ',
 'Central African Rep. ',
 'Chad ',
 'Chile ',
 'China ',
 'Colombia ',
 'Comoros ',
 'Congo, Dem. Rep. ',
 'Congo, Repub. of the ',
 'Cook Islands ',
 'Costa Rica ',
 "Cote d'Ivoire ",
 'Croatia ',
 'Cuba ',
 'Cyprus ',
 'Czech Republic ',
 'Denmark ',
 'Djibouti ',
 'Dominica ',
 'Dominican Republic ',
 'East Timor ',
 'Ecuador ',
 'Egypt ',
 'El Salvador ',
 'Equatorial Guinea ',
 'Eritrea ',
 'Estonia ',
 'Ethiopia '

### What's causing the problem? Change the dataframe so `Country` values match the syntax of the TikTok data.

In [29]:
sub_df['Country'] = sub_df['Country'].str.rstrip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df['Country'] = sub_df['Country'].str.rstrip()


In [30]:
sub_df['Country'].tolist()

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua & Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia & Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Is.',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Central African Rep.',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Repub. of the',
 'Cook Islands',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'East Timor',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guian

In [31]:
merger2 = tokDF.merge(sub_df, how="left")
merger2

Unnamed: 0,Rank,Username,Owner,Followers (millions),Likes (millions),Description,Country,Population,Literacy (%)
0,1,@charlidamelio,Charli D'Amelio,113.9,9200,Dancer and social media personality,United States,298444200.0,970.0
1,2,@addisonre,Addison Rae,79.9,5100,Dancer and social media personality,United States,298444200.0,970.0
2,3,@bellapoarch,Bella Poarch,63.8,1400,Social media personality,United States,298444200.0,970.0
3,4,@zachking,Zach King,58.9,724,Filmmaker and social media personality,United States,298444200.0,970.0
4,5,@tiktok,TikTok,53.0,250,Social media platform,United States,298444200.0,970.0
5,6,@spencerx,Spencer Polanco Knight,52.7,1300,Beatboxer and social media personality,United States,298444200.0,970.0
6,7,@willsmith,Will Smith,52.6,315,Actor,United States,298444200.0,970.0
7,8,@lorengray,Loren Gray,52.1,2800,"Singer, dancer, and social media personality",United States,298444200.0,970.0
8,9,@dixiedamelio,Dixie D'Amelio,51.3,2900,Singer and social media personality,United States,298444200.0,970.0
9,10,@justmaiko,Michael Le,47.4,1300,Dancer and social media personality,United States,298444200.0,970.0


In [32]:
merger2.dtypes

Rank                     object
Username                 object
Owner                    object
Followers (millions)    float64
Likes (millions)         object
Description              object
Country                  object
Population              float64
Literacy (%)             object
dtype: object