# Coding for Economists - Advanced Session 1

## 1. Setup Environment

In [None]:
%pip install selenium

In [1]:
import pandas as pd
import re
import time

## 2. HTTP Requests

### 2.1 Request HTML from URL

In [4]:
from urllib.request import urlopen

url = 'https://www.ft.com/'
page = urlopen(url)
type(page)

http.client.HTTPResponse

In [5]:
page_bytes = page.read()
type(page_bytes)

bytes

__Why Use Bytes__:
1. Everything is 0s and 1s
2. Read, write, cache or stream data without having to interpret it
3. Send large files in chunks, resume interrupted downloads
4. Avoid encoding problems

In [6]:
page_bytes[:1000]

b'<!DOCTYPE html><html lang="en-GB" class="no-js core o-typography--loading-sans o-typography--loading-sans-bold o-typography--loading-display o-typography--loading-display-bold" data-o-component="o-typography" style="overflow-x:hidden;background-color:#fff1e5;color:#33302e"><head><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1"/><title>Financial Times</title><meta name="description" content="News, analysis and opinion from the Financial Times on the latest in markets, economics and politics"/><meta name="robots" content="index,follow,max-snippet:200,max-image-preview:large"/><meta name="google-site-verification" content="4-t8sFaPvpO5FH_Gnw1dkM28CQepjzo8UjjAkdDflTw"/><script type="application/ld+json">{"@context":"http://schema.org","@type":"WebSite","name":"Financial Times","alternateName":"FT.com","url":"https://www.ft.com/"}</script><meta property="fb:pages" content="8860325749"/><meta pr

In [7]:
# Decode the bytes file into a str containing the html
html = page_bytes.decode('utf-8')
type(html)

str

In [8]:
html[:1000]

'<!DOCTYPE html><html lang="en-GB" class="no-js core o-typography--loading-sans o-typography--loading-sans-bold o-typography--loading-display o-typography--loading-display-bold" data-o-component="o-typography" style="overflow-x:hidden;background-color:#fff1e5;color:#33302e"><head><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1"/><title>Financial Times</title><meta name="description" content="News, analysis and opinion from the Financial Times on the latest in markets, economics and politics"/><meta name="robots" content="index,follow,max-snippet:200,max-image-preview:large"/><meta name="google-site-verification" content="4-t8sFaPvpO5FH_Gnw1dkM28CQepjzo8UjjAkdDflTw"/><script type="application/ld+json">{"@context":"http://schema.org","@type":"WebSite","name":"Financial Times","alternateName":"FT.com","url":"https://www.ft.com/"}</script><meta property="fb:pages" content="8860325749"/><meta pro

### 2.2 HTML (HyperText Markup Language) Structure

__HTML Tutorial__: https://www.w3schools.com/html/html_intro.asp

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Simple HTML Example</title>
</head>
<body>

  <h3>My Simple Page</h3>

  <p>This is a simple paragraph of text on my page.</p>

  <img src="ecb.avif" alt="Placeholder image">

  <table>
    <tr>
      <th>Name</th>
      <th>Age</th>
    </tr>
    <tr>
      <td>Alice</td>
      <td>30</td>
    </tr>
    <tr>
      <td>Bob</td>
      <td>25</td>
    </tr>
  </table>

</body>
</html>

In [9]:
len(html)

362708

## 3. Parse HTML Using `BeautifulSoup`
Beautifulsoup transforms a complex HTML document into a tree of Python objects.

__Beautifulsoup Tutorial__: https://beautiful-soup-4.readthedocs.io/en/latest/#

### 3.1 Make the Soup

In [10]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [13]:
soup.title

<title>Financial Times</title>

In [14]:
soup.img

<img alt="US and Ukraine sign natural resources deal " class="image image--width-280" src="https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F93044edb-7182-4013-b2d5-e42f5047fbdd.jpg?source=next-home-page&amp;dpr=2&amp;width=280&amp;fit=scale-down"/>

In [15]:
soup.p

<p class="standfirst"><a aria-hidden="false" class="link" data-trackable="standfirst-link" data-trackable-context-story-link="standfirst-link" href="/content/1ae70f6c-6651-46e4-bc14-cfd6befc1474" target="_self"><span class="text text--color-black-60 text-sans--scale-0 text--style--no-active-state" id="">Agreement establishes a ‘reconstruction investment fund’ after weeks of fraught negotiations</span></a></p>

### 3.2 Search the Soup `.find_all()`

In [16]:
unique_tags = { tag.name for tag in soup.find_all(True) }
print(unique_tags)

{'br', 'html', 'small', 'picture', 'h3', 'button', 'time', 'svg', 'source', 'abbr', 'p', 'ol', 'path', 'section', 'a', 'label', 'head', 'iframe', 'footer', 'main', 'h2', 'header', 'ul', 'link', 'script', 'meta', 'span', 'h1', 'body', 'li', 'input', 'img', 'pg-slot', 'title', 'nav', 'noscript', 'form', 'div'}


- __Text Tags__: [`'p'`, `'a'`, `'h1'`, `'h2'`, `'h3'`]
- __Table Tags__: [`'table'`, `'thead'`, `'tbody'`, `'tr'`, `'th'`, `'td'`]
- __Image Tags__: [`'img'`, `'picture'`, `'source'`, `'svg'`]
- __Video Tags__: [`'video'`, `'source'`, `'iframe'`]

#### Find Text from Headers

In [20]:
header_objs = soup.find_all(
    'a',
    class_='o-header__mega-link',
    href=re.compile('content'),
    attrs={'data-trackable': 'link'}
)
headers = [obj.get_text(strip=True) for obj in header_objs]
headers = list(set(headers))
headers[:20]

['What the US stands to lose from a dented dollar',
 'The English town that can’t wait to be a theme park',
 'When does punk protest become hate speech? Irish rappers Kneecap are testing the limits',
 'Goldman’s Waldron says early US trade deals could set investor views on tariffs',
 'The FT’s top tips for a weekend in Washington, DC',
 'Sam Altman’s eyeball-scanning project World makes US debut',
 'Tesla board denies launching search for Musk’s successor',
 'Meet your new investment banker: an AI chatbot',
 'S&P 500 closes higher as investors shrug off disappointing US data',
 'ECB staff say bank promotes wrong people, survey finds',
 'Welcome to ‘peak bloom’ — the country house market where ‘the garden is the clincher’',
 'Microsoft vows to protect European operations from Trump',
 'Norway’s oil fund aims to save $400mn of trading costs using AI',
 'Rise of economics in English schools fails to close subject’s gender gap',
 'A food tour of Flushing, Queens, America’s biggest Chinatow

In [18]:
print(header_objs[0].prettify())

<a class="o-header__mega-link" data-trackable="link" href="/content/1ae70f6c-6651-46e4-bc14-cfd6befc1474">
 US and Ukraine sign natural resources deal
</a>



In [21]:
header_objs = soup.find_all(
    'a',
    class_='o-header__mega-link',
    href=re.compile('content'),
    attrs={'data-trackable': 'link'},
    string=re.compile('Trump')
)
headers = [obj.get_text(strip=True) for obj in header_objs]
headers = list(set(headers))
headers[-20:]

['US economy contracts at 0.3% rate as Trump’s tariffs prompt import surge',
 'Brexit lessons for Trump’s trade war',
 'Microsoft vows to protect European operations from Trump',

#### Find Text from Paragraphs

In [25]:
para_objs = soup.find_all('h3')
texts = [obj.get_text(strip=True) for obj in para_objs]
texts[:20]

['More Opinion',
 'More Europe News',
 'More highlights',
 'More markets news',
 'More technology',
 'Support',
 'Legal & Privacy',
 'Services',
 'Tools',
 'Community & Events',
 'More from the FT Group']

In [None]:
print(para_objs[0].prettify())

#### Find Images

In [26]:
img_objs = soup.find_all('img')
img_title = [obj.get('alt') for obj in img_objs]
img_title[:20]

['US and Ukraine sign natural resources deal ',
 'Australia: caught between a slowing China and a chaotic US',
 'Brexit lessons for Trump’s trade war',
 '',
 '',
 'Tesla board denies launching search for Musk’s successor',
 'China signals opening for trade talks with US',
 'McDonald’s US sales drop by most since height of pandemic in 2020',
 'Trump Organization strikes Gulf deals ahead of US president’s visit',
 'How Pope Francis failed to close the Vatican’s financial gap',
 'Wartime trauma endures through generations',
 '',
 'KKR reports first quarterly loss since 2022',
 'Goldman’s Waldron says early US trade deals could set investor views on tariffs',
 'Spain and Portugal blackout blamed on solar power dependency',
 'Sam Altman’s eyeball-scanning project World makes US debut',
 'Trump’s bombing campaign against Houthis tests his vow to ‘stop wars’',
 'Franklin Templeton to list $1.7bn of Uzbekistan state assets  ',
 'Norway’s oil fund targets $400mn trading cost savings using AI',


In [27]:
img_url = [obj.get('src') for obj in img_objs]
img_url[:20]

['https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F93044edb-7182-4013-b2d5-e42f5047fbdd.jpg?source=next-home-page&dpr=2&width=280&fit=scale-down',
 'https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F70ed9a7d-a3c1-4910-bf15-60295e9f76cb.jpg?source=next-home-page&dpr=2&width=580&fit=scale-down',
 'https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F2e233517-abfd-4c1c-b7bf-4aa36a61c9aa.jpg?source=next-home-page&dpr=2&width=180&fit=scale-down',
 'https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2Fuploaded-files%2FByline_ChrisGiles_v2-62a1bb26-c65a-417f-a325-d881ad51c48d.png?source=next-home-page&dpr=2&width=40&height=40&fit=cover&gravity=poi',
 'https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cl

In [28]:
print(img_objs[0].prettify())

<img alt="US and Ukraine sign natural resources deal " class="image image--width-280" src="https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F93044edb-7182-4013-b2d5-e42f5047fbdd.jpg?source=next-home-page&amp;dpr=2&amp;width=280&amp;fit=scale-down"/>



## 4. Control Webpage Using `selenium`
__Selenium Tutorial__: https://www.selenium.dev/documentation/webdriver/getting_started/first_script/

### 4.1 Connect Website
#### Initiate Browser

In [29]:
import selenium.webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = selenium.webdriver.Chrome()
driver.get('http://www.ft.com')

#### Accept Cookies

In [30]:
iframes = driver.find_elements(By.TAG_NAME, 'iframe')
print(f"Found {len(iframes)} iframes.")

for idx, iframe in enumerate(iframes):
    print(f"Iframe {idx}: {iframe.get_attribute('outerHTML')[:200]}...")

Found 3 iframes.
Iframe 0: <iframe name="__tcfapiLocator" title="__tcfapiLocator" style="display: none;"></iframe>...
Iframe 1: <iframe name="__gppLocator" style="display: none;"></iframe>...
Iframe 2: <iframe src="https://consent-manager.ft.com/index.html?hasCsp=true&amp;message_id=1274423&amp;consentUUID=null&amp;consent_origin=https%3A%2F%2Fconsent-manager.ft.com%2Fconsent%2Ftcfv2&amp;preload_mes...


In [31]:
driver.switch_to.frame(iframes[2])
accept_btn = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept Cookies')]"))
)
accept_btn.click()

#### Scroll Down to Show the Entire Page

In [None]:
# def ScrollPage(ScrollNumber = 5, ScrollSleep = 1):
#     for i in range(1,ScrollNumber):
#         driver.execute_script("window.scrollTo(1,50000)")
#         time.sleep(ScrollSleep)

# ScrollPage()

### 4.2 Navigate the Page

#### Click on A Link

In [32]:
# List the First 20 Headers
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
header_objs = soup.find_all(
    'a',
    class_='o-header__mega-link',
    href=re.compile('content'),
    attrs={"data-trackable": "link"},
    string=re.compile('Trump')
)
headers = [obj.get_text(strip=True) for obj in header_objs]
headers[:20]

['Brexit lessons for Trump’s trade war',
 'US economy contracts at 0.3% rate as Trump’s tariffs prompt import surge',
 'Brexit lessons for Trump’s trade war',
 'US economy contracts at 0.3% rate as Trump’s tariffs prompt import surge',
 'Microsoft vows to protect European operations from Trump',
 'Brexit lessons for Trump’s trade war',

In [35]:
headers[1]

'US economy contracts at 0.3% rate as Trump’s tariffs prompt import surge'

In [36]:
# Click on the First Header
link = driver.find_element(By.LINK_TEXT, headers[1])
link.click()

In [37]:
url_save = driver.current_url

### 4.3 Fill in Forms

In [38]:
# Sign in FT.com Account
url_login = 'https://accounts.ft.com/login'
driver.get(url_login)

In [39]:
# Enter email address
wait = WebDriverWait(driver, 15)
email_field = wait.until(EC.element_to_be_clickable((By.ID, 'enter-email')))
email_field.clear()
email_field.send_keys('USERNAME')

In [40]:
# Click Next
next_btn = driver.find_element(By.ID, 'enter-email-next')
next_btn.click()

In [None]:
# Enter Passwords
password_field = wait.until(EC.visibility_of_element_located((By.ID, 'enter-password')))
password_field.clear()
password_field.send_keys('PASSWORD')

In [41]:
# click the sign in button
next_btn = driver.find_element(By.ID, 'sign-in-button')
next_btn.click()

### 4.4 Scrape Current Page

In [42]:
driver.get(url_save)

In [43]:
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
para_objs = soup.find_all('p')
texts = [obj.get_text(strip=True) for obj in para_objs]
texts

['',
 'Claire Jonesin Washington',
 'Publishedyesterday',
 'Updated17:46',
 'The US economy contracted by an annualised 0.3 per cent over the first quarter, as companies in the world’s largest economy responded to Donald Trump’s trade war by rushing to import goods.',
 'The fall in the GDP reading — the first since 2022 — was worse than economists’ most recent forecasts and compared with the 2.4 per cent rise for the fourth quarter.',
 'It was largely the result of companies’ rush to buy goods from abroad ahead ofthe US president’s sweeping tariffs, with imports rising at an annualised rate of 41 per cent.',
 'Many analysts argued that the headline GDP number was principally brought down by an extraordinaryincrease in the US trade deficit, rather than reflecting underlying trends.',
 'The calculation used for Wednesday’s figure arrives at GDP by subtracting imports from total spending, including domestic consumption, investment and exports.',
 'Morgan Stanley economists said the surge 

In [44]:
''.join(texts)

'Claire Jonesin WashingtonPublishedyesterdayUpdated17:46The US economy contracted by an annualised 0.3 per cent over the first quarter, as companies in the world’s largest economy responded to Donald Trump’s trade war by rushing to import goods.The fall in the GDP reading — the first since 2022 — was worse than economists’ most recent forecasts and compared with the 2.4 per cent rise for the fourth quarter.It was largely the result of companies’ rush to buy goods from abroad ahead ofthe US president’s sweeping tariffs, with imports rising at an annualised rate of 41 per cent.Many analysts argued that the headline GDP number was principally brought down by an extraordinaryincrease in the US trade deficit, rather than reflecting underlying trends.The calculation used for Wednesday’s figure arrives at GDP by subtracting imports from total spending, including domestic consumption, investment and exports.Morgan Stanley economists said the surge of imports ultimately contributed to inventori

### 4.5. Polite Request

In [None]:
# Human-like Pauses
import random, time

def human_pause(mean=1.5, std=0.5):
    time.sleep(max(0, random.gauss(mean, std)))

# after each navigation or click…
human_pause()

In [None]:
# Use exponential back-off on failures
backoff = 1
while True:
    try:
        driver.get(url)
        break
    except TimeoutException:
        time.sleep(backoff)
        backoff = min(backoff * 2, 30)

In [None]:
# Identify yourself
opts = selenium.webdriver.ChromeOptions()
opts.add_argument('user-agent=MyBot/1.0 (+https://mydomain.com/bot-info)')

In [45]:
# Close browser when finished
driver.quit()

## 5. Scrape Online Tables
__Tutorial__: https://oxylabs.io/blog/python-scrape-tables

In [46]:
# Request HTML
from urllib.request import Request, urlopen

url = 'https://www.worldometers.info/world-population/population-by-country/'

# Set a fake browser user agent
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}

# Build a request with headers
req = Request(url, headers=headers)

page = urlopen(req)
page_bytes = page.read()
html = page_bytes.decode('utf-8')

In [47]:
# Mkae the soup
soup = BeautifulSoup(html, 'html.parser')

# Find the table
table = soup.find("table")

In [48]:
# Extract headers
headers = []
for th in table.find_all('th'):
    headers.append(th.get_text(strip=True))
print(headers)

['#', 'Country (ordependency)', 'Population(2025)', 'YearlyChange', 'NetChange', 'Density(P/Km²)', 'Land Area(Km²)', 'Migrants(net)', 'Fert.Rate', 'MedianAge', 'UrbanPop %', 'WorldShare']


In [49]:
# Extract all rows
rows = []
for tr in table.find_all('tr'):
    cells = tr.find_all('td')
    row = [cell.get_text(strip=True) for cell in cells]
    if row:  # only append non-empty rows
        rows.append(row)
print(rows[:2])

[['1', 'India', '1,463,865,525', '0.89%', '12,929,734', '492', '2,973,190', '−495,753', '1.94', '28.8', '37.1%', '17.78%'], ['2', 'China', '1,416,096,094', '−0.23%', '−3,225,184', '151', '9,388,211', '−268,126', '1.02', '40.1', '67.5%', '17.20%']]


In [50]:
# Build pandas DataFrame
df = pd.DataFrame(rows, columns=headers)
df.head()

Unnamed: 0,#,Country (ordependency),Population(2025),YearlyChange,NetChange,Density(P/Km²),Land Area(Km²),Migrants(net),Fert.Rate,MedianAge,UrbanPop %,WorldShare
0,1,India,1463865525,0.89%,12929734,492,2973190,"−495,753",1.94,28.8,37.1%,17.78%
1,2,China,1416096094,−0.23%,"−3,225,184",151,9388211,"−268,126",1.02,40.1,67.5%,17.20%
2,3,United States,347275807,0.54%,1849236,38,9147420,1230663,1.62,38.5,82.8%,4.22%
3,4,Indonesia,285721236,0.79%,2233305,158,1811570,"−39,509",2.1,30.4,59.6%,3.47%
4,5,Pakistan,255219554,1.57%,3950390,331,770880,"−1,235,336",3.5,20.6,34.4%,3.10%


## 6. Scrape Multi-page Websites

In [51]:
# Request HTML
from urllib.request import Request, urlopen

url = 'https://scholar.google.com/citations?hl=en&vq=bus_economics&view_op=list_hcore&venue=6OFMzPxOGXUJ.2024'

# Set a fake browser user agent
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}

# Build a request with headers
req = Request(url, headers=headers)

page = urlopen(req)
page_bytes = page.read()
html = page_bytes.decode('utf-8')

In [52]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv

url = 'https://scholar.google.com/citations?hl=en&vq=bus_economics&view_op=list_hcore&venue=6OFMzPxOGXUJ.2024'
driver = selenium.webdriver.Chrome()
driver.get(url)

#### `.find_elements()` String Format
- Starts with tag name `TAGNAME` (optional)
- Use dots before class name `.CLASSNAME`
- Connect multiple names in the same level without spaces `TAGNAME.CLASSNAME1.CLASSNAME2`
- Connect names from a lower level with a space `.CLASSNAME TAGNAME_LOWER.CLASSNAME_LOWER`

In [53]:
all_data = []

while True:
    # Scrape data
    titles = driver.find_elements(By.CSS_SELECTOR, ".gsc_mpat_ttl a")
    info = driver.find_elements(By.CSS_SELECTOR, ".gs_gray")
    info = info[:-1]
    authors = info[0::2]
    journals = info[1::2]
    citations = driver.find_elements(By.CSS_SELECTOR, ".gsc_mpat_c a")
    years = driver.find_elements(By.CSS_SELECTOR, ".gsc_mpat_y span")
    years = years[1:]
    
    for title, author, journal, citation, year in zip(titles, authors, journals, citations, years):
        all_data.append({
            'title': title.text,
            'author': author.text,
            'journal': journal.text,
            'citation': citation.text,
            'year': year.text,
            'link': title.get_attribute('href')
        })
    # Try to go to the next page
    try:
        button = driver.find_element(By.CSS_SELECTOR, "button.gs_btnPR.gs_in_ib.gs_btn_half.gs_btn_lsb.gs_btn_srt.gsc_pgn_pnx")
        if not button.is_enabled():
            print('Reached last page.')
            break
        button.click()
        time.sleep(2)  # Wait for next page to load
    except:
        print("No more pages.")
        break

Reached last page.


In [54]:
# create a DataFrame
df = pd.DataFrame(all_data)
print(df.shape)
df.head()

(153, 6)


Unnamed: 0,title,author,journal,citation,year,link
0,Two-Way Fixed Effects Estimators with Heteroge...,"C de Chaisemartin, X D’Haultfœuille","American Economic Review 110 (9), 2964-2996",3639,,https://scholar.google.com/scholar?oi=bibs&clu...
1,"Bartik Instruments: What, When, Why, and How","P Goldsmith-Pinkham, I Sorkin, H Swift","American Economic Review 110 (8), 2586-2624",2027,,https://scholar.google.com/scholar?oi=bibs&clu...
2,Macroeconomic Implications of COVID-19: Can Ne...,"V Guerrieri, G Lorenzoni, L Straub, I Werning","American Economic Review 112 (5), 1437-1474",1594,,https://scholar.google.com/scholar?oi=bibs&clu...
3,Importing Political Polarization? The Electora...,"D Autor, D Dorn, G Hanson, K Majlesi","American Economic Review 110 (10), 3139-3183",1515,,https://scholar.google.com/scholar?oi=bibs&clu...
4,Are Ideas Getting Harder to Find?,"N Bloom, CI Jones, J Van Reenen, M Webb","American Economic Review 110 (4), 1104-1144",1318,,https://scholar.google.com/scholar?oi=bibs&clu...


In [55]:
# Close browser when finished
df.to_csv('AER.csv', index='False')
driver.quit()

## 7. Advanced Data I/O

### 7.1 Batch Read/Write

In [56]:
# Read csv in batches
for i, chunk in enumerate(pd.read_csv('AER.csv', chunksize=20)):
    # chunk is a DataFrame of up to chunksize rows
    print(f"Processing chunk {i}, rows {len(chunk)}")
    # … your processing logic here …
    # e.g. transform, filter, write to DB, etc.

Processing chunk 0, rows 20
Processing chunk 1, rows 20
Processing chunk 2, rows 20
Processing chunk 3, rows 20
Processing chunk 4, rows 20
Processing chunk 5, rows 20
Processing chunk 6, rows 20
Processing chunk 7, rows 13


In [57]:
# Write csv in batches
rows_per_file = 50
output_prefix = 'AER_'

for i in range(0, len(df), rows_per_file):
    chunk = df.iloc[i : i + rows_per_file]
    file_idx = i // rows_per_file 
    filename = f"{output_prefix}{file_idx:02d}.csv"
    chunk.to_csv(filename, index=False)
    print(f"Wrote {len(chunk)} rows to {filename}")

Wrote 50 rows to AER_00.csv
Wrote 50 rows to AER_01.csv
Wrote 50 rows to AER_02.csv
Wrote 3 rows to AER_03.csv


### 7.2 Compression

#### Compression Ratio
- `zip`: Reduces ~60%
- `gzip`: Reduces ~70%
- `7z`: Reduces ~75%

In [58]:
# Compress data into gzip format
df.to_csv('AER.csv.gz', index=False, compression='gzip')

# Compress data into zip format
df.to_csv('AER.csv.zip', index=False, compression='zip')

In [59]:
# Compare sizes
import os

raw_size = os.path.getsize('AER.csv')
gzip_size = os.path.getsize('AER.csv.gz')
zip_size = os.path.getsize('AER.csv.zip')
print(f"Raw file size: {raw_size / (1024**2):.2f} MB")
print(f"Gzip file size: {gzip_size / (1024**2):.2f} MB")
print(f"zip file size: {zip_size / (1024**2):.2f} MB")

Raw file size: 0.03 MB
Gzip file size: 0.01 MB
zip file size: 0.01 MB


In [60]:
# Read compressed file using pandas
df_gz = pd.read_csv('AER.csv.gz', compression='gzip')
df_gz.head()

Unnamed: 0,title,author,journal,citation,year,link
0,Two-Way Fixed Effects Estimators with Heteroge...,"C de Chaisemartin, X D’Haultfœuille","American Economic Review 110 (9), 2964-2996",3639,,https://scholar.google.com/scholar?oi=bibs&clu...
1,"Bartik Instruments: What, When, Why, and How","P Goldsmith-Pinkham, I Sorkin, H Swift","American Economic Review 110 (8), 2586-2624",2027,,https://scholar.google.com/scholar?oi=bibs&clu...
2,Macroeconomic Implications of COVID-19: Can Ne...,"V Guerrieri, G Lorenzoni, L Straub, I Werning","American Economic Review 112 (5), 1437-1474",1594,,https://scholar.google.com/scholar?oi=bibs&clu...
3,Importing Political Polarization? The Electora...,"D Autor, D Dorn, G Hanson, K Majlesi","American Economic Review 110 (10), 3139-3183",1515,,https://scholar.google.com/scholar?oi=bibs&clu...
4,Are Ideas Getting Harder to Find?,"N Bloom, CI Jones, J Van Reenen, M Webb","American Economic Review 110 (4), 1104-1144",1318,,https://scholar.google.com/scholar?oi=bibs&clu...


### 7.3 Pickle

In [61]:
df.dtypes

title       object
author      object
journal     object
citation    object
year        object
link        object
dtype: object

In [62]:
df_gz.dtypes

title        object
author       object
journal      object
citation      int64
year        float64
link         object
dtype: object

In [63]:
df.equals(df_gz)

False

In [64]:
df.to_pickle('AER.pkl')
df_pickle = pd.read_pickle('AER.pkl')
df.equals(df_pickle)

True