<a href="https://colab.research.google.com/github/guilhermelaviola/IntegratingPracticeInDataScienceForBusiness/blob/main/DataCollectionCleaningAndProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Collection, Cleaning and Processing**

# **Part 1: Web Scraping**

**Objectives**
- Collect data from a website using the Python web scraping technique.
- Identify and extract specific information, such as article titles or product prices.

**Step by Step**
- Choose a website that allows web scraping (check the website's terms of use).
- Import the requests and BeautifulSoup libraries.
- Set the website's URL and make the HTTP request.
Analyze the page's HTML content and extract the desired information.

In [25]:
import requests
from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

# Getting current date in dd-mm-yyyy format:
extraction_date = datetime.now().strftime('%d-%m-%Y')

url = 'https://store.inter.it/en-br/match-kit/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
products = soup.find_all(class_='tile-body')

data = []

for product in products:
    text = product.get_text(strip=True)

    # Extracting using regex: product name, price, sizes:
    match = re.match(r"^(.*?)(?:From)?R\$ ?([\d.,]+)([A-Z]+)$", text)

    if match:
        product_name = match.group(1).strip()
        price = match.group(2).strip()
        sizes = ' '.join(match.group(3))  # Separate letters
        data.append([product_name, price, sizes, extraction_date])
    else:
        data.append([text, '', '', extraction_date])  # fallback in case of mismatch

# Writing to .CSV:
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Product Name', 'Price (BRL)', 'Sizes', 'Date'])
    writer.writerows(data)

# **Part 2: Data Cleaning with OpenRefine**

**Objective**
- Import data collected via web scraping into OpenRefine.
- Perform basic cleaning operations, such as removing duplicates and standardizing formats.

**Step by Step**
- Import data into OpenRefine in CSV format.
- Use OpenRefine's features to:
  - Remove duplicate records.
  - Standardize date or numeric formats.
  - Correct inconsistencies in data, such as typos.

# **Part 3: Data Processing with Python**

**Objective**
- Use the pandas library to clean and transform the collected data.

**Step by Step**
- Import the pandas library and load the data.
- Replace null values, remove duplicates, and standardize date formats.
- Filter the data to include only relevant records.

In [30]:
# Importing Pandas
import pandas as pd

# Loading data from the .CSV file:
data = pd.read_csv('data.csv')

# Replacing null values with 0:
data['Price (BRL)'].fillna(0, inplace=True)

# Removing duplicates:
data.drop_duplicates(inplace=True)

# Normalizing the date format by changing it from '%Y-%m-%d' to '%d-%m-%Y':
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y', errors='coerce')

# errors='coerce' will handle errors by setting invalid parsing to NaT (Not a Time)

# Converting 'Price (BRL)' to numeric, handling errors:
data['Price (BRL)'] = pd.to_numeric(data['Price (BRL)'].str.replace('.','').str.replace(',','.'), errors='coerce')

# Filtering products with price > 800:
data = data[data['Price (BRL)'] > 800]

# Saving cleaned data:
data.to_csv('inter_products_cleaned.csv', index=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Price (BRL)'].fillna(0, inplace=True)


# **Part 4: Analysis and Reflection**

**Objective**
- Reflect on the data collection, cleaning and processing process.
- Identify the challenges faced and the solutions applied.