# China's Annual GDP Per Province (Additional Data)
Gathering, cleaning and exporting additional data for **China's Annual GDP Per Province** project.

In [7]:
import pandas as pd
import wikipedia as wp
import re
from datetime import datetime
import time

## China's Statistical Regions
Scraping, cleaning and exporting region categorisation of China's provinces from [Wikipedia](https://en.wikipedia.org/wiki/List_of_regions_of_China).

### Loading & Inspecting

In [2]:
china_regions_page = wp.page('List of regions of China')
html = china_regions_page.html().encode('UTF-8')

# reading in table
china_regions = pd.read_html(html)[3]

china_regions.head(3)

Unnamed: 0,Region,Map,Area,Population(2010),PopulationDensity,Provinces/Region,Provincial/Regional Seat
0,North Chinaåå (HuÃ¡bÄi),,"1,556,061 km2",164823226,105/km2,Beijing,Dongcheng District
1,North Chinaåå (HuÃ¡bÄi),,"1,556,061 km2",164823226,105/km2,Tianjin,Heping District
2,North Chinaåå (HuÃ¡bÄi),,"1,556,061 km2",164823226,105/km2,Hebei,Shijiazhuang


In [3]:
china_regions['Region'].unique()

array(['North Chinaå\x8d\x8eå\x8c\x97 (HuÃ¡bÄ\x9bi)',
       'Northeast Chinaä¸\x9cå\x8c\x97 (DÅ\x8dngbÄ\x9bi)',
       'East Chinaå\x8d\x8eä¸\x9c (HuÃ¡dÅ\x8dng)',
       'South Central Chinaä¸\xadå\x8d\x97 (ZhÅ\x8dngnÃ¡n)',
       'Southwest Chinaè¥¿å\x8d\x97 (XÄ«nÃ¡n)',
       'Northwest Chinaè¥¿å\x8c\x97 (XÄ«bÄ\x9bi)'], dtype=object)

### Cleaning
Cleaning the dataset by:
1. Dropping all columns except `Region` and `Provinces/Region`.
2. Changing column names to snakecase.
3. Cleaning `Region` column values. These were the result of Chinese characters being present in the original page.

In [4]:
regions_cleaned = china_regions[['Region', 'Provinces/Region']]
regions_cleaned.columns = ['region', 'province']

regions_cleaned.region = regions_cleaned.region\
                            .str.extract('(\w+(?=\s+China)\s+China)', flags=re.IGNORECASE)

regions_cleaned.region.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


array(['North China', 'Northeast China', 'East China', 'Central China',
       'Southwest China', 'Northwest China'], dtype=object)

### Exporting
Exporting to .csv

In [11]:
date = time.strftime('%y%m%d%H%M')

regions_cleaned.to_csv(f'input/china_regions_{date}.csv', index=False)