# Types of Data

### 1. **Structured Data**
   - **Relational Database Data**: 
     - **Example**: A SQL database containing tables like `Customers`, `Orders`, and `Products`.
     - **Columns**: `CustomerID`, `OrderID`, `ProductID`, `OrderDate`, `Quantity`, `Price`.
   - **CSV Files**:
     - **Example**: A CSV file with employee records.
     - **Columns**: `EmployeeID`, `Name`, `Department`, `Salary`, `HireDate`.

### 2. **Semi-Structured Data**
   - **JSON Data**:
     - **Example**: A JSON file containing user profiles.
     - **Structure**:
       ```json
       {
         "user_id": "123",
         "name": "John Doe",
         "preferences": {
           "color": "blue",
           "food": "pizza"
         },
         "purchase_history": [
           {"product_id": "001", "date": "2023-01-01", "amount": 25.5},
           {"product_id": "002", "date": "2023-02-14", "amount": 15.0}
         ]
       }
       ```
   - **XML Data**:
     - **Example**: An XML file containing book information.
     - **Structure**:
       ```xml
       <book>
         <title>Data Engineering 101</title>
         <author>Jane Smith</author>
         <published>2021-05-10</published>
         <price>39.99</price>
       </book>
       ```

### 3. **Unstructured Data**
   - **Text Files**:
     - **Example**: A collection of text files with raw customer feedback.
     - **Content**: 
       ```
       "The product is great but the delivery was slow."
       "I love the new features in the latest update!"
       ```
   - **Log Files**:
     - **Example**: Server logs capturing user activities.
     - **Content**:
       ```
       2023-08-13 10:22:34, UserID: 123, Action: Login
       2023-08-13 10:23:01, UserID: 123, Action: Viewed Product, ProductID: 456
       ```

### 4. **Time-Series Data**
   - **Sensor Data**:
     - **Example**: Data from IoT sensors monitoring temperature and humidity.
     - **Columns**: `Timestamp`, `SensorID`, `Temperature`, `Humidity`.
     - **Sample**:
       ```
       2023-08-13 10:00:00, Sensor_01, 25.4, 60
       2023-08-13 10:05:00, Sensor_01, 25.7, 58
       ```
   - **Financial Data**:
     - **Example**: Stock price data.
     - **Columns**: `Timestamp`, `StockSymbol`, `OpenPrice`, `ClosePrice`, `Volume`.
     - **Sample**:
       ```
       2023-08-13 09:30:00, AAPL, 145.6, 147.3, 50000
       2023-08-13 09:45:00, AAPL, 147.3, 148.0, 62000
       ```

### 5. **Graph Data**
   - **Social Network Data**:
     - **Example**: A graph dataset representing a social network.
     - **Nodes**: Users.
     - **Edges**: Friend relationships.
     - **Structure**:
       ```json
       {
         "nodes": [
           {"user_id": "123", "name": "Alice"},
           {"user_id": "124", "name": "Bob"}
         ],
         "edges": [
           {"from": "123", "to": "124", "type": "friend"}
         ]
       }
       ```

### 6. **Geospatial Data**
   - **Location Data**:
     - **Example**: GPS coordinates of delivery trucks.
     - **Columns**: `TruckID`, `Latitude`, `Longitude`, `Timestamp`.
     - **Sample**:
       ```
       TRK_001, 40.7128, -74.0060, 2023-08-13 08:00:00
       TRK_002, 34.0522, -118.2437, 2023-08-13 08:05:00
       ```

### 7. **Image Data**
   - **Image Files**:
     - **Example**: A dataset of labeled images for computer vision tasks.
     - **Structure**: 
       - `Image`: A file representing the image.
       - `Label`: A category label for the image (e.g., "cat", "dog").
     - **File**: `image_001.jpg` with label `cat`.

### 8. **Audio Data**
   - **Audio Files**:
     - **Example**: A collection of audio recordings for speech recognition.
     - **Files**: `audio_001.wav`, `audio_002.wav`.
     - **Metadata**: 
       - `Duration`: Length of the audio.
       - `Transcript`: Text transcription of the speech.

### 9. **Video Data**
   - **Video Files**:
     - **Example**: Security camera footage.
     - **Files**: `video_001.mp4`.
     - **Metadata**: 
       - `Duration`: Length of the video.
       - `Timestamp`: When the video was recorded.

### 10. **Streaming Data**
   - **Real-Time Data Streams**:
     - **Example**: Real-time tweets from Twitter API.
     - **Structure**: JSON objects with fields like `tweet_id`, `user_id`, `timestamp`, `text`.

These examples can help students understand how to handle, process, and analyze different types of data in various data engineering contexts.

# XML

In [3]:
import requests
from bs4 import BeautifulSoup

In [5]:
# URL of IMDb's Top 250 movies page
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the web page")
else:
    print(f"Failed to fetch the web page. Status code: {response.status_code}")


Failed to fetch the web page. Status code: 403


In [6]:
# Parse the content of the page
soup = BeautifulSoup(response.content, 'html.parser')


In [None]:
# Find the table that contains the movie data
movies_table = soup.find('tbody', class_='lister-list')

# Extract the list of movies
movies = movies_table.find_all('tr')

# Initialize an empty list to store movie details
top_movies = []

# Loop through the first 10 movies
for movie in movies[:10]:
    # Get the movie title
    title_column = movie.find('td', class_='titleColumn')
    title = title_column.a.text.strip()

    # Get the release year
    year = title_column.span.text.strip('()')

    # Get the IMDb rating
    rating_column = movie.find('td', class_='ratingColumn imdbRating')
    rating = rating_column.strong.text.strip()

    # Store the movie details in a dictionary
    movie_details = {
        'title': title,
        'year': year,
        'rating': rating
    }

    # Add the movie details to the list
    top_movies.append(movie_details)

# Print out the scraped movie data
for idx, movie in enumerate(top_movies, start=1):
    print(f"{idx}. {movie['title']} ({movie['year']}) - Rating: {movie['rating']}")


# JSON Data

In [8]:
# Load JSON data from a file
with open('data.json', 'r') as file:
    data = json.load(file)


FileNotFoundError: [Errno 2] No such file or directory: 'data.json'

In [9]:
json_data = '''{
    "employees": [
        {"name": "John Doe", "age": 30, "department": "Engineering"},
        {"name": "Jane Smith", "age": 25, "department": "Marketing"},
        {"name": "Mike Johnson", "age": "N/A", "department": "Sales"}
    ]
}'''

# Load JSON data from a string
data = json.loads(json_data)


NameError: name 'json' is not defined

In [None]:
# Assuming the JSON data is a dictionary with a list of dictionaries as one of its values
df = pd.DataFrame(data['employees'])


In [None]:
df.replace({"N/A": None, "": None}, inplace=True)


In [None]:
df['age'] = pd.to_numeric(df['age'], errors='coerce')


In [None]:
df.dropna(inplace=True)


In [None]:
df.rename(columns={'name': 'full_name', 'department': 'dept'}, inplace=True)


In [None]:
df = df[df['age'] > 25]


In [None]:
# Save the cleaned DataFrame to a new JSON file
df.to_json('cleaned_data.json', orient='records', indent=4)

# Or save as a CSV file
df.to_csv('cleaned_data.csv', index=False)
