---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

In [17]:
# Load in necessary packages
import pandas as pd
import numpy as np

We start off by examining the dataset gained from the  American Trends Panel Wave 111 conducted by the Pew Research Center.

In [None]:
# Read in raw data - a sav file
W111_df = pd.read_spss("../../data/raw-data/ATP_W111.sav")

#View the first several rows of the data
print(W111_df.head)

    QKEY INTERVIEW_START_W111  INTERVIEW_END_W111 DEVICE_TYPE_W111 LANG_W111  \
0  113.0  2022-07-05 16:28:49 2022-07-06 03:44:16       Smartphone   English   
1  115.0  2022-07-05 16:28:59 2022-07-16 20:39:46       Smartphone   English   
2  116.0  2022-07-05 16:29:59 2022-07-05 16:39:21           Tablet   English   
3  117.0  2022-07-05 16:30:18 2022-07-05 16:36:32        Laptop/PC   English   
4  119.0  2022-07-05 16:30:35 2022-07-05 16:38:35       Smartphone   English   

    XTABLET_W111        SHOP18_W111  \
0  Non-tablet HH               Some   
1  Non-tablet HH               None   
2  Non-tablet HH  All or almost all   
3  Non-tablet HH               Some   
4  Non-tablet HH               None   

                                         SHOP19_W111  \
0  I try to make sure that I always have cash wit...   
1  I don’t really worry much about whether or not...   
2  I don’t really worry much about whether or not...   
3  I don’t really worry much about whether or not...   
4  I

We find that the Data Frame has a 6034 rows and 139 columns. That implies there will need to be some columns to be eliminated which will be done in the data cleaning step. 

In [None]:
# Disply data frame shape and column titles
print(W111_df.shape)
print(W111_df.columns)

(6034, 139)
Index(['QKEY', 'INTERVIEW_START_W111', 'INTERVIEW_END_W111',
       'DEVICE_TYPE_W111', 'LANG_W111', 'XTABLET_W111', 'SHOP18_W111',
       'SHOP19_W111', 'METOO1_W111', 'METOOSUPOE_M1_W111',
       ...
       'F_PARTYLN_FINAL', 'F_PARTYSUM_FINAL', 'F_PARTYSUMIDEO_FINAL',
       'F_INC_SDT1', 'F_REG', 'F_IDEO', 'F_INTFREQ', 'F_VOLSUM', 'F_INC_TIER2',
       'WEIGHT_W111'],
      dtype='object', length=139)


We also find that 4 out of the 139 columns are non-categorical which implies, we have to continue searching for data for a regression.

In [25]:
# Display column data types
print(W111_df.dtypes)
non_categorical_columns = []
for col in W111_df.columns:
    if W111_df[col].dtype != "category":
        non_categorical_columns.append(col)
        
print(non_categorical_columns)

QKEY                           float64
INTERVIEW_START_W111    datetime64[ns]
INTERVIEW_END_W111      datetime64[ns]
DEVICE_TYPE_W111              category
LANG_W111                     category
                             ...      
F_IDEO                        category
F_INTFREQ                     category
F_VOLSUM                      category
F_INC_TIER2                   category
WEIGHT_W111                    float64
Length: 139, dtype: object
['QKEY', 'INTERVIEW_START_W111', 'INTERVIEW_END_W111', 'WEIGHT_W111']


Then, we import and view the 2023 Consumer Expenditure Survey Data.

In [26]:
#  Import data for income
income_1_df = pd.read_csv("../../data/raw-data/itii232.csv")
income_2_df = pd.read_csv("../../data/raw-data/itii233.csv")
income_3_df = pd.read_csv("../../data/raw-data/itii234.csv")
income_4_df = pd.read_csv("../../data/raw-data/itii241.csv")

Each of the the Data Frames have 8 columns. Their shapes are displayed below. We find that the data frames have identical columns so they can be merged in the data collection stage, and we can filter for relevant columns.

In [None]:
print(income_1_df.head)
print(income_2_df.head)

<bound method NDFrame.head of           NEWID  REFMO  REFYR     UCC  PUBFLAG VALUE_  IMPNUM        VALUE
0       5090604      1   2023  900030        2    NaN       1  3169.833300
1       5090604      1   2023  900030        2    NaN       2  3169.833300
2       5090604      1   2023  900030        2    NaN       3  3169.833300
3       5090604      1   2023  900030        2    NaN       4  3169.833300
4       5090604      1   2023  900030        2    NaN       5  3169.833300
...         ...    ...    ...     ...      ...    ...     ...          ...
330445  5366911      5   2023  980071        2    NaN       1   820.250000
330446  5366911      5   2023  980071        2    NaN       2   250.000000
330447  5366911      5   2023  980071        2    NaN       3   100.000000
330448  5366911      5   2023  980071        2    NaN       4   294.666667
330449  5366911      5   2023  980071        2    NaN       5   160.250000

[330450 rows x 8 columns]>
<bound method NDFrame.head of           NE

In [29]:
print(income_3_df.head)
print(income_4_df.head)

<bound method NDFrame.head of           NEWID  REFMO  REFYR     UCC  PUBFLAG VALUE_  IMPNUM   VALUE
0       5251754      7   2023  800940        2    NaN       1   382.5
1       5251754      7   2023  800940        2    NaN       2   382.5
2       5251754      7   2023  800940        2    NaN       3   382.5
3       5251754      7   2023  800940        2    NaN       4   382.5
4       5251754      7   2023  800940        2    NaN       5   382.5
...         ...    ...    ...     ...      ...    ...     ...     ...
322315  5573581     11   2023  980071        2    NaN       1  6802.0
322316  5573581     11   2023  980071        2    NaN       2  6802.0
322317  5573581     11   2023  980071        2    NaN       3  6802.0
322318  5573581     11   2023  980071        2    NaN       4  6802.0
322319  5573581     11   2023  980071        2    NaN       5  6802.0

[322320 rows x 8 columns]>
<bound method NDFrame.head of           NEWID  REFMO  REFYR     UCC  PUBFLAG VALUE_  IMPNUM        VAL

There are more datasets: mtbi

## Example

In the following code, we first utilized the requests library to retrieve the HTML content from the Wikipedia page. Afterward, we employed BeautifulSoup to parse the HTML and locate the specific table of interest by using the find function. Once the table was identified, we extracted the relevant data by iterating through its rows, gathering country names and their respective populations. Finally, we used Pandas to store the collected data in a DataFrame, allowing for easy analysis and visualization. The data could also be optionally saved as a CSV file for further use. 


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the table containing the data (usually the first table for such lists)
table = soup.find('table', {'class': 'wikitable'})

# Step 4: Extract data from the table rows
countries = []
populations = []

# Iterate over the table rows
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td')
    if len(cells) > 1:
        country = cells[1].text.strip()  # The country name is in the second column
        population = cells[2].text.strip()  # The population is in the third column
        countries.append(country)
        populations.append(population)

# Step 5: Create a DataFrame to store the results
data = pd.DataFrame({
    'Country': countries,
    'Population': populations
})

# Display the scraped data
print(data)

# Optionally save to CSV
data.to_csv('../../data/raw-data/countries_population.csv', index=False)


                                 Country     Population
0                                  World  8,119,000,000
1                                  China  1,409,670,000
2                          1,404,910,000          17.3%
3                          United States    335,893,238
4                              Indonesia    281,603,800
..                                   ...            ...
235                   Niue (New Zealand)          1,681
236                Tokelau (New Zealand)          1,647
237                         Vatican City            764
238  Cocos (Keeling) Islands (Australia)            593
239                Pitcairn Islands (UK)             35

[240 rows x 2 columns]


{{< include closing.qmd >}} 