### Exercise 1: Identifying Data Types

Below are various data sources. Identify whether each one is an example of structured or unstructured data.

- A company’s financial reports stored in an Excel file - **stractured data**
- Photographs uploaded to a social media platform - **unstructered data**
- A collection of news articles on a website - **unstructered data**
- Inventory data in a relational database - **stractured data**
- Recorded interviews from a market research study - **unstructered data**

### Exercise 2: Transformation Exercise

For each of the following unstructured data sources, propose a method to convert it into structured data. Explain your reasoning.

- **A series of blog posts about travel experiences.**<br>
  We can process this data using NLP and keyword extraction to identify categories such as countries, themes, emotions, and locations. After that, we can create a table with the extracted features.
- **Audio recordings of customer service calls.**<br>
  This data could be converted to structured data by method **speech-to-text**.<br> 
First, we transcribe the audio into text and then define the parameters that are important for our analysis: who is speaking, the speaker’s emotion (using audio feature extraction), the main topic of the call, etc. After that, we create a table with these parameters and add relevant information extracted from the transcription.
- **Handwritten notes from a brainstorming session.**<br>
   For this data, we can use text recognition (OCR) to extract the handwritten text from images and then apply NLP to structure it. We can create a table with keywords, themes, and other important elements.
- **A video tutorial on cooking.**<br>
We can apply video-to-text methods to extract spoken instructions and key steps. Then we can create a table with the list of ingredients, their proportions, and the order in which they are used, as well as a table describing cooking times.
 

### Exercise 3 : Import a file from Kaggle

In [2]:
import pandas as pd

In [7]:
train_data = pd.read_csv(r"C:\Users\User\Desktop\train.csv") # Import the train dataset. Use the train.csv file.

In [9]:
train_data.head() # Print the first few rows of the DataFrame

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Exercise 4: Importing a CSV File

In [12]:
iris_data = pd.read_csv('Iris_dataset.csv', header=None) # Import the CSV file using Pandas

In [13]:
iris_data.head() # Display the first five rows of the dataset

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [14]:
iris_data_wheaders = pd.read_csv('Iris_dataset.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']) # Import the CSV file using Pandas with headers

In [15]:
iris_data_wheaders.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Exercise 5 : Export a dataframe to excel format and JSON format.

In [27]:
import pandas as pd
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl

   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   ---------------------------------------- 2/2 [openpyxl]

Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\User\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
json_data = pd.read_json('https://jsonplaceholder.typicode.com/posts') # Use Pandas to read the JSON data.

In [30]:
json_data.to_excel('json_to_excel.xlsx', sheet_name='Sheet1', index=False) # Display the first five entries of the data

### Exercise 6: Reading JSON Data

In [36]:
json_read_data = pd.read_json(r'C:\Users\User\Desktop\posts.json') # Use Pandas to read the JSON data.

In [32]:
json_read_data.head() # Display the first five entries of the data.

Unnamed: 0,userId,id,title,body
0,1,1,sunt aut facere repellat provident occaecati e...,quia et suscipit\nsuscipit recusandae consequu...
1,1,2,qui est esse,est rerum tempore vitae\nsequi sint nihil repr...
2,1,3,ea molestias quasi exercitationem repellat qui...,et iusto sed quo iure\nvoluptatem occaecati om...
3,1,4,eum et est occaecati,ullam et saepe reiciendis voluptatem adipisci\...
4,1,5,nesciunt quas odio,repudiandae veniam quaerat sunt sed\nalias aut...
