<a href="https://colab.research.google.com/github/nkinsman16/HW-Week-3-BasicCleaning_R/blob/main/Format_numeric_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# FORMATTING - numeric values



Let me ***scrape*** this [table](https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity/country-comparison/) from the CIA World Fact Book.

In [1]:
# read data from website
linkCiaPurchPower='https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity/country-comparison/'

library(rvest)

# get tables from website
ciapupow = read_html(linkCiaPurchPower)%>%html_table()%>%.[[1]] #get first table, there is only one

# see table scraped
head(ciapupow)


Rank,Country,Var.3,Date of Information
<int>,<chr>,<chr>,<chr>
1,China,"$33,598,000,000,000",2024 est.
2,United States,"$25,676,000,000,000",2024 est.
3,India,"$14,244,000,000,000",2024 est.
4,Russia,"$6,089,000,000,000",2024 est.
5,Japan,"$5,715,000,000,000",2024 est.
6,Germany,"$5,247,000,000,000",2024 est.


The data needs cleaning before formatting. Review previous contents and do the cleaning.

 ### I.1 Cleaning before formatting

You see currency symbols and commas above, then you know R has not identified the 3rd column as numerical.

The first step is to know if only '**$**' and '**,**' are present beside numbers:

In [2]:
# replace any digit by 'nothing' ('')
# and show unique values

# function
byeDigits=function(x) gsub('\\d','',x)
# check
unique(apply(ciapupow[,c(3)],1,FUN = byeDigits))


We have confirmed that the column will need a simple cleaning by replacing values. Let me first change that column name:

In [3]:
names(ciapupow)[3]='RealGDP'

Let's create a function to clean by replacing and getting rid of leading and trailing spaces:

In [4]:
keepDigits=function(x) gsub('\\D','',trimws(x))

ciapupow[,c(3)]=  apply(ciapupow[,c(3)],1,FUN = keepDigits)

# Notice 'RealGDP' is still character type
str(ciapupow)

tibble [221 × 4] (S3: tbl_df/tbl/data.frame)
 $ Rank               : int [1:221] 1 2 3 4 5 6 7 8 9 10 ...
 $ Country            : chr [1:221] "China" "United States" "India" "Russia" ...
 $ RealGDP            : chr [1:221] "33598000000000" "25676000000000" "14244000000000" "6089000000000" ...
 $ Date of Information: chr [1:221] "2024 est." "2024 est." "2024 est." "2024 est." ...


In [5]:
ciapupow

Rank,Country,RealGDP,Date of Information
<int>,<chr>,<chr>,<chr>
1,China,33598000000000,2024 est.
2,United States,25676000000000,2024 est.
3,India,14244000000000,2024 est.
4,Russia,6089000000000,2024 est.
5,Japan,5715000000000,2024 est.
6,Germany,5247000000000,2024 est.
7,Brazil,4165000000000,2024 est.
8,Indonesia,4102000000000,2024 est.
9,France,3732000000000,2024 est.
10,United Kingdom,3636000000000,2024 est.


While *RealGDP* is clean, you do not have the right format:

In [6]:
summary(ciapupow$RealGDP)

   Length     Class      Mode 
      221 character character 


### I.2 Formatting number using **as_numeric**:

This is the function that will solve our problem.

You can only apply this function when you are sure your numeric column is clean.

Let's see:


In [7]:
ciapupow$RealGDP=as.numeric(ciapupow$RealGDP)

#Notice 'RealGDP' is numeric now
str(ciapupow)

tibble [221 × 4] (S3: tbl_df/tbl/data.frame)
 $ Rank               : int [1:221] 1 2 3 4 5 6 7 8 9 10 ...
 $ Country            : chr [1:221] "China" "United States" "India" "Russia" ...
 $ RealGDP            : num [1:221] 3.36e+13 2.57e+13 1.42e+13 6.09e+12 5.72e+12 ...
 $ Date of Information: chr [1:221] "2024 est." "2024 est." "2024 est." "2024 est." ...


In [8]:
summary(ciapupow$RealGDP)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
7.712e+06 9.308e+09 7.125e+10 7.802e+11 3.809e+11 3.360e+13 

If the cell isn’t clean, R will coerce the value to NA. Explicitly handling missing values during cleaning is preferable to relying on automatic coercion