# Exercises XP - W7D4

# Exercise 1: Identifying Data Types

Below are various data sources. Identify whether each one is an example of structured or unstructured data.

- A companyâ€™s financial reports stored in an Excel file - structured data
- Photographs uploaded to a social media platfor - unstructured data
- A collection of news articles on a website - unstructured data-
- Inventory data in a relational database - structured data
- Recorded interviews from a market research study - unstructured data

# Exercise 2: Transformation Exercise

For each of the following unstructured data sources, propose a method to convert it into structured data. Explain your reasoning.

- A series of blog posts about travel experiences.

Processing: Organize the free text into a tabular format that allows for subsequent quantitative or categorical analysis.

Use text parsing/natural language processing (NLP) to extract key information [Destinations visited | Dates | Opinions (positive, negative) | Keywords]

Save the Post_ID, Destination, Date, Sentiment, and Keywords columns in a DataFrame.



- Audio recordings of customer service calls.

Processing: Convert the audio into tabular data that can be statistically analyzed.
1) Speech-to-text (automatic transcription) to convert the audio to text.
2) Apply NLP to extract information:
3) Query type (Billing, Technical, General)
4) Call duration, date, operator
5) Customer sentiment (Positive, Negative, Neutral)

Save to table: Call_ID, Date, Duration, Date, Operator, Category, Sentiment, Transcript.

- Handwritten notes from a brainstorming session.

Processing: Scan the handwritten notes and apply OCR (Optical Character Recognition) to extract digital text. This will allow the creation of a structured dataset.

Then, process the text to:

- Extract individual ideas
- Categorize by topic
- Identify responsible parties or actions (if mentioned)

Save in a table: Note_ID, Idea, Category, Owner/Action.


- A video tutorial on cooking.

Processing: Organize the video content into structured steps to enable analysis, search, or automation (e.g., recipe apps).

* Extract audio from the video and apply speech-to-text to obtain instructions.

* Detect timestamps and divide them into steps.

* Extract ingredients and quantities using NLP.

* Save the following data in a table: Video_ID, Step_Number, Instruction, Ingredient, Quantity, Timestamp.

# Exercise 3 : Import the train.csv file from GitHub

In [8]:
import pandas as pd
import zipfile    #To work with ZIP archives (compress/uncompress files).
import io         # To handle in-memory streams (like files, but in RAM)
import requests   # To download files or fetch data from the internet

# -----------------------------
# URL of the ZIP file on GitHub (raw)
# -----------------------------
url = "https://github.com/devtlv/Datasets-DA-Bootcamp-2-/raw/main/Week%204%20-%20Data%20Understanding/W4D3%20-%20Importing%20Data%2C%20Exporting%20D/train.zip"

# -----------------------------
# Download the ZIP file
# -----------------------------
response = requests.get(url)       #This line downloads the ZIP file from GitHub and stores it in memory.
if response.status_code == 200:
    z = zipfile.ZipFile(io.BytesIO(response.content))    #This line converts the downloaded file (bytes) into a file-like object that zipfile can read without saving it to disk.
    print("Files in the ZIP:", z.namelist())
else:
    print("Error downloading the file:", response.status_code)

# -----------------------------
# Read the CSV file inside the ZIP
# -----------------------------
# Adjust 'train.csv' to the exact name that appears in z.namelist()
df = pd.read_csv(z.open('train.csv'))              #Opens the ZIP file in memory.
                                                   #Allows you to read the CSV inside the ZIP directly.

# -----------------------------
# # Display the first few rows
# -----------------------------
print(df.head())


Files in the ZIP: ['train.csv']
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373

#Exercise 5 : Export a dataframe to excel format and JSON format.



In [9]:
import pandas as pd
from google.colab import files  # To download the files

# Create a simple DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'Member': [True, False, True]
})

# Export to Excel
df.to_excel('simple_dataframe.xlsx', index=False)

# Export to JSON
df.to_json('simple_dataframe.json', orient='records', date_format='iso')


# Download the files in Colab
files.download('simple_dataframe.xlsx')
files.download('simple_dataframe.json')

# Preview the DataFrame
print(df)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

      Name  Age  Member
0    Alice   25    True
1      Bob   30   False
2  Charlie   22    True


#Exercise 6: Reading JSON Data

In [10]:
import pandas as pd
import zipfile    #To work with ZIP archives (compress/uncompress files).
import io         # To handle in-memory streams (like files, but in RAM)
import requests   # To download files or fetch data from the internet


# -----------------------------
# URL of the sample JSON dataset  on GitHub (raw)
# -----------------------------
url = "https://github.com/devtlv/Datasets-DA-Bootcamp-2-/raw/main/Week%204%20-%20Data%20Understanding/W4D3%20-%20Importing%20Data%2C%20Exporting%20D/posts.zip"

# -----------------------------
# Download the ZIP file
# -----------------------------
response = requests.get(url)
if response.status_code == 200:
    z = zipfile.ZipFile(io.BytesIO(response.content))
    print("Files in the ZIP:", z.namelist())   # Check the JSON file name
else:
    print("Error downloading the file:", response.status_code)

# -----------------------------
# Read the JSON file inside the ZIP into a DataFrame
# -----------------------------
# Adjust 'posts.json' to the exact name listed in z.namelist()
df = pd.read_json(z.open('posts.json'))

# -----------------------------
# Display the first 5 entries
# -----------------------------
print(df.head())


Files in the ZIP: ['posts.json']
   userId  id                                              title  \
0       1   1  sunt aut facere repellat provident occaecati e...   
1       1   2                                       qui est esse   
2       1   3  ea molestias quasi exercitationem repellat qui...   
3       1   4                               eum et est occaecati   
4       1   5                                 nesciunt quas odio   

                                                body  
0  quia et suscipit\nsuscipit recusandae consequu...  
1  est rerum tempore vitae\nsequi sint nihil repr...  
2  et iusto sed quo iure\nvoluptatem occaecati om...  
3  ullam et saepe reiciendis voluptatem adipisci\...  
4  repudiandae veniam quaerat sunt sed\nalias aut...  
