<a href="https://colab.research.google.com/github/guilhermelaviola/ApplicationOfDataScienceForBusiness/blob/main/Class02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Collection, Cleaning, & Processing**
Data science transforms raw data into actionable insights through a structured process that begins with ethical and effective data collection, followed by careful cleaning and processing. Techniques such as web scraping using Python libraries enable the extraction of valuable information, while tools like Google Analytics and direct data collection methods provide additional rich and contextual data sources. However, collected data often contains inconsistencies and requires preparation before analysis. Tools such as OpenRefine and Python’s Pandas library play a critical role in organizing, standardizing, and correcting data to ensure quality and reliability. By combining these tools and methods, data scientists can ensure data integrity, avoid misleading conclusions, and focus on extracting meaningful insights that support informed decision-making and business success.

## **Web Scraping**

First, we will import the necessary libraries, such as requests to make HTTP requests and BeautifulSoup to parse the HTML. Then, we define the URL of the website from which the data will be extracted. Remember that the website must allow this type of activity.

Web scraping is a technique that uses tools like Python to capture data available on websites. It's essential to consider the ethical and legal context when performing web scraping, ensuring that websites allow this practice. Web scraping is particularly useful for collecting unstructured data, such as text and sentences, which can then be analyzed using keyword dictionaries and other tools.

In [None]:
# Importing all the necessary libraries:
import requests
from bs4 import BeautifulSoup

In [None]:
url = 'https://sempreinter.com/latest-news/'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
headlines = soup.find_all(class_= 'main_article')

for headline in headlines:
  print(headline.get_text())













Revealed – Why Fenerbahce Refused To Make A Concrete Offer For Inter Milan &Türkiye Star




July 22, 2025 19:53




Hakan Calhanoglu will stay at Inter Milan next season despite being linked with a move to Fenerbahce. According to Fabrizio Romano via FCInterNews, the Canaries…














Report – No Contact Between Inter Milan & RB Leipzig Over Netherlands Star




July 22, 2025 19:40




Inter Milan and RB Leipzig have not opened talks over Xavi Simons despite the Italian club’s alleged interest in the Dutchman. According to L’Interista via…














Como Manager Explains Decision To Turn Down Inter Milan Approach: “Signed 4-Year Contract For A Reason”




July 22, 2025 17:33




( 6 )





Como’s head coach, Cesc Fabregas, has explained his decision to turn down an approach from Inter Milan earlier this summer. During a recent interview via…







Latest Inter Videos












Agent Of Turkey Superstar Denies Fenerbahce & Galatasaray Talks: “He’ll Stay At Inte

## **Practical Activity**
The goal is to collect data from a website using web scraping techniques with Python, identifying and extracting specific information such as article titles or product prices.

### **Goals**
- To collect data from a website using web scraping techniques with Python.
- To identify and extract specific information, such as article titles or product prices.

### **Step-by-Step**
- Choose a website that allows web scraping (check the website's terms of use).
- Import the requests and BeautifulSoup libraries.
- Define the website's URL and make the HTTP request.
- Analyze the page's HTML content and extract the desired information.

In [1]:
# Importing all the necessary libraries and resources:
import requests
from bs4 import BeautifulSoup
import csv
import re
from datetime import datetime

In [2]:
# Using today's date to create the 'date' column in the format 'dd-mm-yyyy':
extraction_date = datetime.now().strftime('%d-%m-%Y')

# Defining variables for the URL, response, BeautifulSoup object, and products on the Inter Store page, the online store of the Italian club Internazionale.
url = 'https://store.inter.it/en-br/match-kit/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
products = soup.find_all(class_='tile-body') # Retrieving all page elements of the 'tile-body' class

# Creating an empty array to store the products:
data = []

for product in products:
    text = product.get_text(strip=True)

    # Dividing the row into columns using regex, generating the columns 'name', 'price', and 'sizes':
    match = re.match(r"^(.*?)(?:From)?R\$ ?([\d.,]+)([A-Z]+)$", text)

    if match:
        product_name = match.group(1).strip()
        price = match.group(2).strip()
        sizes = ' '.join(match.group(3))
        data.append([product_name, price, sizes, extraction_date])
    else:
        data.append([text, '', '', extraction_date])

# Saving the file into a .CSV file:
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Product Name', 'Price (BRL)', 'Sizes', 'Date'])
    writer.writerows(data)

In [3]:
# Importing Pandas:
import pandas as pd

# Loading data from the .CSV file:
data = pd.read_csv('data.csv')

# Replacing null values ​​with 0:
data['Price (BRL)'].fillna(0, inplace=True)

# Removing duplicates:
data.drop_duplicates(inplace=True)

# Standardizing the date format. Changing from '%Y-%m-%d' to '%d-%m-%Y'
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y', errors='coerce')

# Converting 'Price (BRL)' to a column with numeric values:
data['Price (BRL)'] = pd.to_numeric(data['Price (BRL)'].str.replace('.','').str.replace(',','.'), errors='coerce')

# Filtering to include only products with a price above R$ 800:
data = data[data['Price (BRL)'] > 800]

# Saving clean data:
data.to_csv('inter_products_cleaned.csv', index=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Price (BRL)'].fillna(0, inplace=True)


## **Analysis and Reflections**
During data collection, the task wasn't too complicated because the code automatically extracted everything from the page and immediately saved it as a .CSV file. Web scraping simplifies things considerably compared to manual data extraction. The most difficult part was adding a date column, since the page didn't have any attribute that could be saved as a date in the file.

Data cleaning was also done using Python code, and since the dataset was small, it wasn't too difficult. I used regex to divide the entire extracted HTML object into different columns, such as product description, price, size... I confess it took me a while to figure out how to do this. It's a very useful function applicable to almost any e-commerce website.

In the data processing phase, code was created to replace null values ​​with 0, remove duplicates, standardize the date format (changing from '%Y-%m-%d' to '%d-%m-%Y'), convert 'Price (BRL)' to a column with numeric values ​​– this part was difficult because I had to convert the column to numeric, even though the data was already numeric (double), and filter prices to include only products priced above R$ 800. After that, the data was saved again in a new dataset with clean and filtered data.

It is extremely important to ensure the quality and integrity of the data so that the analysis can generate value for the company or person investing time in it, since poorly collected or processed data can generate unrealistic analyses that would be useless in generating value, which is the main objective of data analysis.