# Read Damaged Data

Data interpretation varies with format, necessitating organization into dataframes for machine readability. This step transforms diverse data into a structured, table-like format, facilitating manipulation and analysis. It's crucial for identifying trends and ensuring data quality, thus setting the foundation for subsequent analytical and predictive tasks in a unified, efficient manner.

In [10]:
import os
os.listdir()

['us_tourism2damage.csv',
 'damaged_students.csv',
 'damaged.json',
 'us_tourism1.json',
 'Orchlon_Quiz33.ipynb',
 'us_tourism.csv',
 'unstructureddata.txt',
 'us_tourism.json',
 'damaged_data.txt']

In [11]:
import glob
csvfiles = glob.glob("*.csv")
csvfiles

['us_tourism2damage.csv', 'damaged_students.csv', 'us_tourism.csv']

## 1.1 Read CSV data

In [12]:
import pandas as pd

file_path = 'us_tourism.csv'

# Read the CSV file
dataframe_csv1 = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
dataframe_csv1.head(5)

Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,Clam Chowder,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,Deep-Dish Pizza,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,Key Lime Pie,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival


## 1.2 Handle damaged csv data

In [13]:
file_path2 = 'us_tourism2damage.csv'

# Read the CSV file

df = pd.read_csv(file_path2)

# # Display the first few rows of the DataFrame
df

ParserError: Error tokenizing data. C error: Expected 8 fields in line 48, saw 9


### Cannot read damage file by pandas

We need to read file line by line.

In [15]:
import csv

# Replace 'your_file.csv' with the path to your CSV file
file_path = 'us_tourism2damage.csv'

#read line by line, and filter the bad data
a = 0
title = []
gooddata = []
baddata= []

# Your code:
with open(file_path, mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        if a == 0:
            title = row
        a += 1
        if len(row) == 8 and a != 1:
            gooddata.append(row)
        else:
            baddata.append(row)

In [16]:
gooddata

[['Northeast',
  'New York',
  'Statue of Liberty',
  'Iconic symbol of freedom',
  'Bagels',
  'Boiled and baked dough rings',
  'Year-round',
  'Fourth of July Celebration'],
 ['Northeast',
  'Massachusetts',
  'Freedom Trail',
  'Historic 2.5-mile trail',
  'Clam Chowder',
  'Creamy soup with clams and potatoes',
  'Fall',
  'Boston Marathon'],
 ['Midwest',
  'Illinois',
  'Willis Tower',
  'One of the tallest buildings in the US',
  'Deep-Dish Pizza',
  'Thick crust pizza with layers of toppings',
  'Summer',
  'Taste of Chicago'],
 ['Midwest',
  'Michigan',
  'The Great Lakes',
  'Largest group of freshwater lakes',
  'Pasties',
  'Pastry filled with meat and vegetables',
  'Winter',
  'Great Lakes Shipwreck Festival'],
 ['South',
  'Florida',
  'Disney World',
  'Famous theme park and resort',
  'Key Lime Pie',
  'Sweet pie made with Key lime juice',
  'Summer',
  "Disney's Halloween Festival"],
 ['South',
  'Louisiana',
  'French Quarter',
  'Historic New Orleans neighborhood',


In [17]:
gooddf = pd.DataFrame(gooddata,columns = title)
gooddf

Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,Clam Chowder,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,Deep-Dish Pizza,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,Key Lime Pie,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival
5,South,Louisiana,French Quarter,Historic New Orleans neighborhood,Jambalaya,Rice dish with meat and vegetables,Spring,Mardi Gras
6,West,California,Golden Gate Bridge,Iconic suspension bridge,Sourdough Bread,San Francisco's famous crusty bread,Summer,San Francisco Jazz Festival
7,West,Hawaii,Hawaii Volcanoes National Park,,Poke,Raw fish salad,Year-round,Aloha Festivals
8,Southwest,Texas,The Alamo,Historic mission and fortress,Tex-Mex Cuisine,Fusion of Mexican and American cuisines,Spring,San Antonio Fiesta
9,Southwest,Arizona,Grand Canyon,One of the world's natural wonders,Chimichangas,Deep-fried burritos,Fall,Grand Canyon Music Festival


## 2.1 Read json data

In [18]:
jsonfiles = glob.glob("*.json")
jsonfiles

['damaged.json', 'us_tourism1.json', 'us_tourism.json']

In [19]:
# Reading the JSON file
df = pd.read_json(jsonfiles[2],lines = True)

# Displaying the DataFrame
df.head(5)

Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,Clam Chowder,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,Deep-Dish Pizza,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,Key Lime Pie,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival


## 2.2 Handle damaged json Data

In [20]:
# Reading the JSON file
dfDamage = pd.read_json(jsonfiles[0],lines = True)

# Displaying the DataFrame
dfDamage.head(5)



ValueError: Expected object or value

### 2.2.1 Simple demo

In [21]:
import json

# Example JSON data as a list of strings (each string is a JSON object)
json_data = [
    '{"name": "Alice", "age": 25, "city": "New York"}',
    '{"name": "Bob", "age": 30, "city": "San Francisco"}',
    '{"name": "Charlie", "age": 35}',
    '{"name": "Bib", "age": , "city": "Seattle"}',
    '{"name": "Bcob", "city": "Beijing"}',
    '{"name": "Bob", "xxxx": 30, "city": "San Francisco"}'
]

In [22]:
data = []
for json_str in json_data:
    try:
        # Attempt to parse each line as JSON
        json_obj = json.loads(json_str)

        # Handle missing or extra data
        if 'name' not in json_obj:
            json_obj['name'] = None
        if 'age' not in json_obj:
            json_obj['age'] = None
        if 'city' not in json_obj:
            json_obj['city'] = None

        # Append the valid JSON object to the list
        data.append(json_obj)
    
    except json.JSONDecodeError:
        print(f'Invalid Json data')

# data
# Convert the list of dictionaries to a pandas DataFrame (optional)
dfjson= pd.DataFrame(data)

#print
dfjson[['name','age','city']]

Invalid Json data


Unnamed: 0,name,age,city
0,Alice,25.0,New York
1,Bob,30.0,San Francisco
2,Charlie,35.0,
3,Bcob,,Beijing
4,Bob,,San Francisco


### 2.2.3 Handle the damaged json file

In [23]:
file_path = 'damaged.json'  # Replace with your file path
data = []
title = ['Region', 'State', 'Attraction', 'Description', 
 'Food Specialty', 'Food Description', 'Best Time to Visit', 'Notable Events']

with open(file_path, 'r') as file:
    for line in file:
        try:
            # Parse the JSON string
            json_obj = json.loads(line)

            # Handle missing data by setting a default value
            # Your code
            for key in title:
                if key not in json_obj:
                    json_obj[key] = None
                
            
            # Append the processed JSON object to the list
            data.append(json_obj)
        except json.JSONDecodeError:
            print(f'Invalid Json data: {line}')
            

# Convert the list of dictionaries to a pandas DataFrame (optional)
df = pd.DataFrame(data)

# Display the DataFrame
df

Invalid Json data: {"Region":"Southwest","State":"Texas","Attraction":"The Alamo","Description":"Historic mission and fortress","Food Specialty":,"Food Description":"Fusion of Mexican and American cuisines","Best Time to Visit":"Spring","Notable Events":"San Antonio Fiesta"}

Invalid Json data: {"Region":"South","State":"North Carolina","Attraction":"Great Smoky Mountains National Park","Description":,"Food Specialty":"Pulled Pork BBQ","Food Description":"Slow-cooked barbecue pork","Best Time to Visit":"Summer","Notable Events":"Mountain Music Festivals"}

Invalid Json data: {"Region":"West","State":"Nevada","Attraction":"Las Vegas Strip",,"Food Specialty":"Buffet","Food Description":"All-you-can-eat buffets with various cuisines","Best Time to Visit":"Year-round","Notable Events":"Las Vegas New Year's Eve"}

Invalid Json data: {"Region":"West","State":"Colorado","Attraction":"Rocky Mountain National Park",,"Food Specialty":"Rocky Mountain Oysters","Food Description":"Deep-fried bull t

Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,Clam Chowder,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,Deep-Dish Pizza,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,Key Lime Pie,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival
5,South,Louisiana,French Quarter,Historic New Orleans neighborhood,Jambalaya,Rice dish with meat and vegetables,Spring,Mardi Gras
6,West,California,Golden Gate Bridge,Iconic suspension bridge,Sourdough Bread,San Francisco's famous crusty bread,Summer,San Francisco Jazz Festival
7,West,Hawaii,Hawaii Volcanoes National Park,Home to active volcanoes,Poke,Raw fish salad,Year-round,Aloha Festivals
8,Southwest,Arizona,Grand Canyon,One of the world's natural wonders,Chimichangas,Deep-fried burritos,Fall,Grand Canyon Music Festival
9,Pacific Northwest,Washington,Space Needle,Futuristic tower with an observation deck,Salmon,Fresh and smoked fish,Summer,Seattle International Film Festival


## 3.1 Read Txt data

Examples for **"split"**
```python
text.split(" ")

In [24]:
h = 'Region: Northeast State: New York'
result_h = h.split(" ")
result_h

['Region:', 'Northeast', 'State:', 'New', 'York']

In [25]:
file_path = 'unstructureddata.txt'  # use your file path

# Function to process a block of text and return a dictionary
def process_block(block):
    entity = {}
    for item in block: 
        key, value = item.split(":")
        entity[key] = value
    return entity

# Read and process the file
current_data= []
result= []
with open(file_path,'r') as file:
    for line in file:
        # Your code
        if line.strip() == "":
            dfcontent = process_block(current_data)
            result.append(dfcontent)
            current_data = []
        else:
            current_data.append(line.strip())
            
print(result[:3])

[{'Region': ' Northeast', 'State': ' New York', 'Attraction': ' Statue of Liberty', 'Description': ' Iconic symbol of freedom', 'Food Specialty': ' Bagels', 'Food Description': ' Boiled and baked dough rings', 'Best Time to Visit': ' Year-round', 'Notable Events': ' Fourth of July Celebration'}, {'Region': ' Northeast', 'State': ' Massachusetts', 'Attraction': ' Freedom Trail', 'Description': ' Historic 2.5-mile trail', 'Food Specialty': ' Clam Chowder', 'Food Description': ' Creamy soup with clams and potatoes', 'Best Time to Visit': ' Fall', 'Notable Events': ' Boston Marathon'}, {'Region': ' Midwest', 'State': ' Illinois', 'Attraction': ' Willis Tower', 'Description': ' One of the tallest buildings in the US', 'Food Specialty': ' Deep-Dish Pizza', 'Food Description': ' Thick crust pizza with layers of toppings', 'Best Time to Visit': ' Summer', 'Notable Events': ' Taste of Chicago'}]


In [26]:
# # Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(result)
# # Display the DataFrame
df

Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,Clam Chowder,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,Deep-Dish Pizza,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,Key Lime Pie,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival
5,South,Louisiana,French Quarter,Historic New Orleans neighborhood,Jambalaya,Rice dish with meat and vegetables,Spring,Mardi Gras
6,West,California,Golden Gate Bridge,Iconic suspension bridge,Sourdough Bread,San Francisco's famous crusty bread,Summer,San Francisco Jazz Festival
7,West,Hawaii,Hawaii Volcanoes National Park,Home to active volcanoes,Poke,Raw fish salad,Year-round,Aloha Festivals
8,Southwest,Texas,The Alamo,Historic mission and fortress,Tex-Mex Cuisine,Fusion of Mexican and American cuisines,Spring,San Antonio Fiesta
9,Southwest,Arizona,Grand Canyon,One of the world's natural wonders,Chimichangas,Deep-fried burritos,Fall,Grand Canyon Music Festival


## 3.2 Handle damaged txt data

In [27]:
import pandas as pd

In [28]:
"""Your turn -- Class practice"""
file_path = 'damaged_data.txt'  # use your file path

# Define the expected keys
expected_keys = ["Region", "State", "Attraction", "Description", 
                 "Food Specialty", "Food Description", 
                 "Best Time to Visit", "Notable Events"]


# Function to process a block of text and return a dictionary
def process_block_with_damaged(block):
    entity = {key: None for key in expected_keys}
    #Your code:
    for item in block:
        if ": " in item:
            key, value = item.split(": ")
            if key in entity:
                entity[key] = value
    
    return entity

# Read and process the file
current_data= []
result= []
with open(file_path,'r') as file:
    for line in file:
        # Your code:
        if line.strip() == "":
            processed_data = process_block_with_damaged(current_data)
            result.append(processed_data)
            current_data = []
        else:
            current_data.append(line.strip())

# Convert the list of dictionaries into a DataFrame
dfnew = pd.DataFrame(result)

# Display the DataFrame
dfnew


Unnamed: 0,Region,State,Attraction,Description,Food Specialty,Food Description,Best Time to Visit,Notable Events
0,Northeast,New York,Statue of Liberty,Iconic symbol of freedom,Bagels,Boiled and baked dough rings,Year-round,Fourth of July Celebration
1,Northeast,Massachusetts,Freedom Trail,Historic 2.5-mile trail,,Creamy soup with clams and potatoes,Fall,Boston Marathon
2,Midwest,Illinois,Willis Tower,One of the tallest buildings in the US,,Thick crust pizza with layers of toppings,Summer,Taste of Chicago
3,Midwest,Michigan,The Great Lakes,Largest group of freshwater lakes,Pasties,Pastry filled with meat and vegetables,Winter,Great Lakes Shipwreck Festival
4,South,Florida,Disney World,Famous theme park and resort,,Sweet pie made with Key lime juice,Summer,Disney's Halloween Festival
5,South,Louisiana,French Quarter,Historic New Orleans neighborhood,Jambalaya,Rice dish with meat and vegetables,Spring,Mardi Gras
6,West,California,Golden Gate Bridge,Iconic suspension bridge,Sourdough Bread,San Francisco's famous crusty bread,Summer,San Francisco Jazz Festival
7,West,Hawaii,Hawaii Volcanoes National Park,Home to active volcanoes,Poke,Raw fish salad,Year-round,Aloha Festivals
8,Southwest,Texas,The Alamo,Historic mission and fortress,Tex-Mex Cuisine,Fusion of Mexican and American cuisines,Spring,San Antonio Fiesta
9,Southwest,Arizona,Grand Canyon,One of the world's natural wonders,Chimichangas,Deep-fried burritos,Fall,Grand Canyon Music Festival


### Practice 🧪 Lab: Cleaning “Damaged” Data with pandas

In this practice, you’ll practice detecting and fixing common data problems:

- Rows with **missing fields**
- Rows with **extra fields** (e.g., too many columns due to extra commas)
- **Wrong types** (strings in numeric columns)
- **Out-of-range values** (e.g., impossible GPA)
- Empty strings vs true missing values (`NaN`)

You will:  
1) Load a deliberately “damaged” CSV,  
2) Identify & filter bad rows,  
3) Coerce types, and  
4) Produce a clean DataFrame ready for analysis.


In [32]:
# 🚧 Create a deliberately “damaged” CSV (run this cell once)
source_csv = """Student_ID,Name,Major,Credits,GPA,Grad_Year
1001,Alice,Computer Science,120,3.85,2026
1002,Bob,Math, ,3.10,2025
1003,Charlie,Physics,98,3O,2024
1004,Diana,Economics,140,4.7,2023
1005,Ed,Computer Science,129,N/A,2027
1006, ,Biology,88,2.95,2028
1007,Grace,Physics,101,3.40,2029,EXTRA_TOKEN
,Henry,Math,64,2.70,2026
1009,Ivy,,110,3.20,2025
1010,Jack,Economics,215,3.50,2035
1011,Kate,Chemistry,95,3.00,
1012,Liam,Computer Science,NaN,3.60,2024
1013,Mia,Math,77,,2025
1014,Noah,Physics,85,3.25,2027
1015,Olivia,Economics,,2.80,2026
"""



In [33]:
output_path = "damaged_students.csv"
with open(output_path, "w", encoding="utf-8") as f:
    f.write(source_csv)
print(f"Wrote damaged CSV to: {output_path}")

Wrote damaged CSV to: damaged_students.csv


In [34]:
import pandas as pd
import csv

file_path = 'damaged_students.csv'

index = 0
title = []
good_data = []
bad_data = []

with open(file_path, mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        if index == 0:
            title = row
        index += 1
        if len(row) == 6 and index > 1:
            good_data.append(row)
        else:
            bad_data.append(row)


In [35]:
gooddf = pd.DataFrame(good_data,columns = title)
gooddf

Unnamed: 0,Student_ID,Name,Major,Credits,GPA,Grad_Year
0,1001.0,Alice,Computer Science,120.0,3.85,2026.0
1,1002.0,Bob,Math,,3.10,2025.0
2,1003.0,Charlie,Physics,98.0,3O,2024.0
3,1004.0,Diana,Economics,140.0,4.7,2023.0
4,1005.0,Ed,Computer Science,129.0,,2027.0
5,1006.0,,Biology,88.0,2.95,2028.0
6,,Henry,Math,64.0,2.70,2026.0
7,1009.0,Ivy,,110.0,3.20,2025.0
8,1010.0,Jack,Economics,215.0,3.50,2035.0
9,1011.0,Kate,Chemistry,95.0,3.00,
