In [2]:
import pandas as pd
import requests
import altair as alt

[![Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/EconomicsObservatory/courses/HEAD?labpath=5%2Fs5_transforming_data.ipynb)

<a href="https://colab.research.google.com/github/EconomicsObservatory/courses/blob/main/5/s5_transforming_data.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transforming Data

The bread and butter of a data workflow is cleaning and preparation, taking raw datasets and transforming them into a useful form.

Today we're going to produce a chart using two series from the Economics Observatory API that we'll transform with `Pandas` and present with `Vega-lite`. We will produce a chart showing Indian Exports and Imports as a percentage of GDP using data sourced from the API.

</br></br></br></br>

</br></br></br></br>


</br></br></br></br>



## Introducing Tools: Pandas

The first tool we'll use today is `Pandas`, a Python library used to work with datasets. It provides access to `DataFrames` - tables we analyse with code.

Python already has a few built in data structures, for example lists and dictionaries:

</br></br></br>

In [2]:
london = {
    "name": "London",
    "population": 8308369,
    "area": 1572
} # This is an example of a dictionary

locations = [
    {
        "name": "London",
        "population": 8_982_000,
        "area": 606
    },
    {
        "name": "Newport",
        "population": 128_060,
        "area": 32.52
    },
    {
        "name": "Darlington",
        "population": 93_015,
        "area": 7.62
    },

]


<br>
<br>
<br>
<br>
Which we can turn into Pandas `DataFrames`

In [3]:
df = pd.DataFrame(locations)
df

Unnamed: 0,name,population,area
0,London,8982000,606.0
1,Newport,128060,32.52
2,Darlington,93015,7.62


<br>
<br>
<br></br></br>

<br>

and manipulate in different ways.

For example, we can add a density column:

In [4]:
df['density'] = df['population'] / df['area']
df

Unnamed: 0,name,population,area,density
0,London,8982000,606.0,14821.782178
1,Newport,128060,32.52,3937.884379
2,Darlington,93015,7.62,12206.692913


</br></br>
</br>
</br>
</br>
or sort our dataframe

In [5]:
sorted_df = df.sort_values(by="density", ascending=False)
sorted_df

Unnamed: 0,name,population,area,density
0,London,8982000,606.0,14821.782178
2,Darlington,93015,7.62,12206.692913
1,Newport,128060,32.52,3937.884379


# Practical: Transforming data with `Pandas`

In [13]:
df = pd.read_excel("https://github.com/jhellingsdata/RADataHub/raw/refs/heads/main/misc/consumertrendsq22024cpsa1.xlsx", sheet_name="0GSCS", skiprows=5) # Read the data from the Excel file, specifying the sheet name and skipping the first 5 rows

# Let's make the column names more readable
df = df.rename(columns={
    'Time period and codes': 'date',
    'Total goods': 'Goods'
}
)

# Let's make the date column a number and drop everything that isn't one
df['date'] = pd.to_numeric(df['date'], errors='coerce')
df = df.dropna(subset=['date'])

# We only care about goods and services. Let's keep those
df = df[['date', 'Goods', 'Services']]

# Almost there! Let's make this data long-form
df = df.melt(id_vars='date', var_name='series', value_name='value')

# And save it
df.to_csv("consumertrendsq22024cpsa1_long.csv", index=False)

df

Unnamed: 0,date,series,value
0,1997.0,Goods,264637
1,1998.0,Goods,278202
2,1999.0,Goods,291682
3,2000.0,Goods,302782
4,2001.0,Goods,317427
5,2002.0,Goods,325896
6,2003.0,Goods,340238
7,2004.0,Goods,351679
8,2005.0,Goods,360258
9,2006.0,Goods,375783


</br>
</br>
</br>
</br></br></br>
</br></br>


# Introducing Tools: `Requests`

The `Requests` module allows us to fetch resources from the internet, whether these are `CSVs`, `JSONs`, images, `HTML` or anything else. This is particuarly important for requesting data from APIs. 

</br>

It is simple to use. Usually allow we need to do is:

1. Make a request with `requests.get` and our target URL. For example we can request something from our GitHub repos:

    `req = requests.get("https://raw.githubusercontent.com/mclass-user/mclass-user.github.io/main/s2_chart1.json")`
</br>
</br>
2. Access the fetched data. Using `req.json()` for JSON data or `req.text` for most other data. For example, we can see the returned JSON for the chart we just fetched:
    </br>
    </br>
    `data = req.json()`

In [6]:
req = requests.get("https://raw.githubusercontent.com/mclass-user/mclass-user.github.io/main/s2_chart1.json")


data = req.json()
data

{'$schema': 'https://vega.github.io/schema/vega-lite/v5.json',
 'title': {'text': 'Human Development Index ',
  'subtitle': ["P21 Countries' HDI, most recent year", 'Source: UN']},
 'description': 'A simple bar chart with embedded data.',
 'data': {'values': [{'Country': 'Bangladesh', 'HDI': 0.661},
   {'Country': 'Brazil', 'HDI': 0.754},
   {'Country': 'China', 'HDI': 0.768},
   {'Country': 'DR Congo', 'HDI': 0.479},
   {'Country': 'Egypt', 'HDI': 0.731},
   {'Country': 'Ethiopia', 'HDI': 0.498},
   {'Country': 'Germany', 'HDI': 0.942},
   {'Country': 'India', 'HDI': 0.633},
   {'Country': 'Indonesia', 'HDI': 0.705},
   {'Country': 'Iran', 'HDI': 0.774},
   {'Country': 'Japan', 'HDI': 0.925},
   {'Country': 'Mexico', 'HDI': 0.758},
   {'Country': 'Nigeria', 'HDI': 0.535},
   {'Country': 'Pakistan', 'HDI': 0.544},
   {'Country': 'Philippines', 'HDI': 0.699},
   {'Country': 'Russia', 'HDI': 0.829},
   {'Country': 'Thailand', 'HDI': 0.8},
   {'Country': 'Turkey', 'HDI': 0.838},
   {'Coun

</br></br>
</br></br>
</br>
</br>
</br>

# Transforming Data: Indian Exports and Imports

We need to produce a chart showing Indian exports and Imports as a percentage of GDP. To do this, we can use three series from the [Economics Observatory API](https://www.economicsobservatory.com/data-hub):
1. Indian Exports
2. Indian Imports
3. Indian GDP

</br>
</br>

We can find the API urls for these from the [Data-Hub](https://www.economicsobservatory.com/data-hub)

<img
style="max-height: 350px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/data-hub.png">
</img>

</br>
</br>

We find:

1. **Exports**: https://api.economicsobservatory.com/ind/expo
2. **Imports**: https://api.economicsobservatory.com/ind/impo
3. **GDP**: https://api.economicsobservatory.com/ind/gdpa

</br>
</br>


Let's try requesting the first of these, `Exports` and see how it's formatted.

In [7]:
exports_req = requests.get("https://api.economicsobservatory.com/ind/expo")
exports_req.json()

{'author': 'Economics Observatory',
 'source': 'CSO',
 'url': 'https://eco-temp-cache.s3.eu-west-2.amazonaws.com/ind/expo.json',
 'data': [{'date': '2011-04-01', 'value': 493074.88},
  {'date': '2011-07-01', 'value': 503937.31},
  {'date': '2011-10-01', 'value': 557993.27},
  {'date': '2012-01-01', 'value': 588925.54},
  {'date': '2012-04-01', 'value': 555655.03},
  {'date': '2012-07-01', 'value': 694342.2},
  {'date': '2012-10-01', 'value': 566043.39},
  {'date': '2013-01-01', 'value': 623666.38},
  {'date': '2013-04-01', 'value': 616226.48},
  {'date': '2013-07-01', 'value': 747412.79},
  {'date': '2013-10-01', 'value': 726997.88},
  {'date': '2014-01-01', 'value': 766144.14},
  {'date': '2014-04-01', 'value': 706895.73},
  {'date': '2014-07-01', 'value': 727107.86},
  {'date': '2014-10-01', 'value': 732620.04},
  {'date': '2015-01-01', 'value': 697012.65},
  {'date': '2015-04-01', 'value': 671293.85},
  {'date': '2015-07-01', 'value': 687100.61},
  {'date': '2015-10-01', 'value': 67

</br>
</br>

We just care about the data itself. Let's take this and make a dataframe out of it.

We'll rename the `value` column to `export` so we can keep track of what this data is when we merge it.

In [8]:
exports_response = exports_req.json()
exports_data = exports_response['data']
exports_df = pd.DataFrame(exports_data)
exports_df = exports_df.rename(columns={"value": "exports"})
exports_df

Unnamed: 0,date,exports
0,2011-04-01,493074.88
1,2011-07-01,503937.31
2,2011-10-01,557993.27
3,2012-01-01,588925.54
4,2012-04-01,555655.03
5,2012-07-01,694342.2
6,2012-10-01,566043.39
7,2013-01-01,623666.38
8,2013-04-01,616226.48
9,2013-07-01,747412.79


</br>
</br>
</br>

We now have a Pandas `DataFrame` containing the exports data. Let's do the same for imports and GDP.

In [9]:
imports_url = "https://api.economicsobservatory.com/ind/impo"
imports_req = requests.get(imports_url) # This is the request
imports_response = imports_req.json() # This is the response in JSON format
imports_data = imports_response['data'] # We only want the data part of the response, not the metadata
imports_df = pd.DataFrame(imports_data) # We convert the data to a DataFrame
imports_df = imports_df.rename(columns={"value": "imports"}) # We rename the value column to imports

gdp_url = "https://api.economicsobservatory.com/ind/gdpa"
gdp_req = requests.get(gdp_url)
gdp_response = gdp_req.json()
gdp_data = gdp_response['data']
gdp_df = pd.DataFrame(gdp_data)
gdp_df = gdp_df.rename(columns={"value": "gdp"})

</br>
</br>

Now that we have all three series downloaded, we can start merging and transforming the data. 

Let's merge our `gdp_df` with the `exports_df` and `imports_df` so we can express them as a ratio.

In [11]:
exports_ratio_df = exports_df.merge(gdp_df, on="date")
exports_ratio_df.tail(5)

Unnamed: 0,date,exports,gdp
40,2021-04-01,1128349.43,5627315.0
41,2021-07-01,1225739.2,6268323.0
42,2021-10-01,1312634.89,6523591.0
43,2022-01-01,1397161.93,6499562.0
44,2022-04-01,1463323.34,6468804.0


</br>
</br>

Let's now divide the `exports` column by the `gdp` column to get the Ratio and keep only the result and the `date`.

In [30]:
exports_ratio_df['Exports Ratio'] = exports_ratio_df['exports'] / exports_ratio_df['gdp'] # Divide exports by GDP to get the ratio
exports_ratio_df = exports_ratio_df[['date', 'Exports Ratio']] # Select only the date and Exports Ratio columns
exports_ratio_df.head(5)

Unnamed: 0,date,Exports Ratio
0,2011-04-01,0.242958
1,2011-07-01,0.224486
2,2011-10-01,0.230718
3,2012-01-01,0.254593
4,2012-04-01,0.235751


</br>
</br>
</br>

We'll also do the same to the `imports_df`:

In [31]:
imports_ratio_df = imports_df.merge(gdp_df, on="date") # Merge imports and GDP DataFrames on the date column
imports_ratio_df['Imports Ratio'] = imports_ratio_df['imports'] / imports_ratio_df['gdp'] # Divide imports by GDP to get the ratio
imports_ratio_df = imports_ratio_df[['date', 'Imports Ratio']] # Select only the date and Imports Ratio columns
imports_ratio_df.head(5)

Unnamed: 0,date,Imports Ratio
0,2011-04-01,0.308571
1,2011-07-01,0.283714
2,2011-10-01,0.29763
3,2012-01-01,0.316704
4,2012-04-01,0.303726


<br/><br/>

We now have all the data we need in two dataframes: `exports_ratio_df` and `imports_ratio_df`.

To use these in a graph, we'll need them in one dataframe.

In [32]:
trade_df = pd.merge(exports_ratio_df, imports_ratio_df, on="date") # Merge the exports and imports DataFrames on the date column
trade_df.head(5)

Unnamed: 0,date,Exports Ratio,Imports Ratio
0,2011-04-01,0.242958,0.308571
1,2011-07-01,0.224486,0.283714
2,2011-10-01,0.230718,0.29763
3,2012-01-01,0.254593,0.316704
4,2012-04-01,0.235751,0.303726


</br></br>

It's best practice for `Vega-lite` to keep data in a TIDY format where there are columns that specify the series and a column for the data. To tranform our data, we'll use the pandad function `melt`.

In [33]:
trade_df = trade_df.melt(id_vars="date", var_name="series", value_name="value") # Melt the DataFrame so that the exports and imports ratios are in the same column,
                                                                                # Specifying we want columns of date and series
trade_df

Unnamed: 0,date,series,value
0,2011-04-01,Exports Ratio,0.242958
1,2011-07-01,Exports Ratio,0.224486
2,2011-10-01,Exports Ratio,0.230718
3,2012-01-01,Exports Ratio,0.254593
4,2012-04-01,Exports Ratio,0.235751
...,...,...,...
85,2021-04-01,Imports Ratio,0.205880
86,2021-07-01,Imports Ratio,0.216721
87,2021-10-01,Imports Ratio,0.237311
88,2022-01-01,Imports Ratio,0.244439


</br>
</br>

We can now export this data to use in `Vega-lite`.

In [34]:
trade_df.to_csv("s5_indian_trade_data.csv", index=False)

</br></br>

Finally, to get the CSV onto GitHub from Colab. We just have to go to the Files tab in the sidebar, and find our file:

<img
style="max-height: 350px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/colab_download.png">
</img>