# Data Wrangling

**Welcome to the Data Wrangling Notebook!**

The notebook showcases the power of Python to scrape data from the internet. We will use the libraries `pandas` and `matplotlib` to do the following session and exercises.

**Note: This is not a definitive guide.**

## What is Data Wrangling?
Data wrangling is the process of cleaning, organizing, structuring, and enriching raw data to make it more accessible and useful for analysis and visualization purposes. It involves converting and plotting data from one "raw" form into another to make it ready for downstream analytics. Data wrangling is becoming increasingly necessary due to the rapid expansion of the amount of data and data sources available today. It involves removing errors and combining complex data sets to make them more accessible and easier to analyze. The wrangling process encompasses all the practices used to ensure that data is high quality and useful for analytics.

### Domain Driven Data Wrangling
Data wrangling is just one of the tools to work with data. Also equally important is to have your expertise driving your data wrangling and analysis. As they say, **the workflow or tool must not contradict with science of the domain.** When we deliver or create tools or workflows, these must be guided with the principles of the industry that the tool will be used on/with. 

## Prerequisites

Before we start, you will need to have a basic knowledge of the following technologies

- [Python](https://www.python.org/)

## Primary Tool
The primary libraries or tools that we will be using are `pandas` and `matplotlib`. These libraries allow us to work and visualize with structured data. There are also a number of data wrangling tools and libraries supported by Python.

## Hands-on
For this tutorial, we will be using the data that we have scraped from the **web scraping** session. We will be creating a **static map** using containing the data GRDP.    



### Import Libraries
We will import the `panas` and `matplotlib` libraries. 

In [1]:
import pandas as pd
import matplotlib as plt
%matplotlib inline

### Reading Data
Pandas has a `read_csv` function which allows us to read csv files. This automatically converts the data into a `Dataframe`.

In [2]:
economics_data = pd.read_csv("data/wiki_economics.csv")

### Dataframes a way to process data
A dataframe is a 2-dimensional data structure which consists of rows and columns. *You can think for this as an excel spreadsheet.*

In [3]:
economics_data

Unnamed: 0,Region,"GRDP(PHP, thousands)","Agriculture(PHP, thousands)","Industry(PHP, thousands)","Services(PHP, thousands)",GRDP per capita(PHP)
0,Metro Manila,6309290637,442597,1230125141,5078722899,462779
1,Cordillera,322093866,27045337,77990725,217057804,179752
2,Ilocos Region,629772047,104471256,192218332,333082459,120512
3,Cagayan Valley,397625523,103563850,115614177,178447496,109851
4,Central Luzon,2177046900,231995441,950969430,994082029,179840
5,Calabarzon,2861724791,154312287,1445358775,1262053729,181781
6,Mimaropa,377014287,64116478,125427469,187470340,120240
7,Bicol Region,560314934,85820150,202529524,271965260,92288
8,Western Visayas,916379059,144256702,194479931,577642425,116946
9,Central Visayas,1266701029,79478668,342195668,845026693,161289


#### Accessing Data in Dataframes

To access a data in a dataframe, we can use the following methods:

- `iloc` - think of iloc as the row integer location or id 
- `loc` - think of loc as the label location

In [7]:
ex_df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
ex_df

Unnamed: 0,age,color,food,height,score,state
Jane,30,blue,Steak,165,4.6,NY
Nick,2,green,Lamb,70,8.3,TX
Aaron,12,red,Mango,120,9.0,FL
Penelope,4,white,Apple,80,3.3,AL
Dean,32,gray,Cheese,180,1.8,AK
Christina,33,black,Melon,172,9.5,TX
Cornelia,69,red,Beans,150,2.2,TX


In [12]:
ex_df.loc["Jane"]

# now try retrieving "Aaron"

age          30
color      blue
food      Steak
height      165
score       4.6
state        NY
Name: Jane, dtype: object

In [9]:
ex_df.iloc[0]

# now try retrieving "Aaron" using iloc

age          30
color      blue
food      Steak
height      165
score       4.6
state        NY
Name: Jane, dtype: object

Back in our main example, `economics_data` will have the same `iloc` and `loc` uses. 