# Types of Data

### 1. **Structured Data**
   - **Relational Database Data**: 
     - **Example**: A SQL database containing tables like `Customers`, `Orders`, and `Products`.
     - **Columns**: `CustomerID`, `OrderID`, `ProductID`, `OrderDate`, `Quantity`, `Price`.
   - **CSV Files**:
     - **Example**: A CSV file with employee records.
     - **Columns**: `EmployeeID`, `Name`, `Department`, `Salary`, `HireDate`.

### 2. **Semi-Structured Data**
   - **JSON Data**:
     - **Example**: A JSON file containing user profiles.
     - **Structure**:
       ```json
       {
         "user_id": "123",
         "name": "John Doe",
         "preferences": {
           "color": "blue",
           "food": "pizza"
         },
         "purchase_history": [
           {"product_id": "001", "date": "2023-01-01", "amount": 25.5},
           {"product_id": "002", "date": "2023-02-14", "amount": 15.0}
         ]
       }
       ```
   - **XML Data**:
     - **Example**: An XML file containing book information.
     - **Structure**:
       ```xml
       <book>
         <title>Data Engineering 101</title>
         <author>Jane Smith</author>
         <published>2021-05-10</published>
         <price>39.99</price>
       </book>
       ```

### 3. **Unstructured Data**
   - **Text Files**:
     - **Example**: A collection of text files with raw customer feedback.
     - **Content**: 
       ```
       "The product is great but the delivery was slow."
       "I love the new features in the latest update!"
       ```
   - **Log Files**:
     - **Example**: Server logs capturing user activities.
     - **Content**:
       ```
       2023-08-13 10:22:34, UserID: 123, Action: Login
       2023-08-13 10:23:01, UserID: 123, Action: Viewed Product, ProductID: 456
       ```

### 4. **Time-Series Data**
   - **Sensor Data**:
     - **Example**: Data from IoT sensors monitoring temperature and humidity.
     - **Columns**: `Timestamp`, `SensorID`, `Temperature`, `Humidity`.
     - **Sample**:
       ```
       2023-08-13 10:00:00, Sensor_01, 25.4, 60
       2023-08-13 10:05:00, Sensor_01, 25.7, 58
       ```
   - **Financial Data**:
     - **Example**: Stock price data.
     - **Columns**: `Timestamp`, `StockSymbol`, `OpenPrice`, `ClosePrice`, `Volume`.
     - **Sample**:
       ```
       2023-08-13 09:30:00, AAPL, 145.6, 147.3, 50000
       2023-08-13 09:45:00, AAPL, 147.3, 148.0, 62000
       ```

### 5. **Graph Data**
   - **Social Network Data**:
     - **Example**: A graph dataset representing a social network.
     - **Nodes**: Users.
     - **Edges**: Friend relationships.
     - **Structure**:
       ```json
       {
         "nodes": [
           {"user_id": "123", "name": "Alice"},
           {"user_id": "124", "name": "Bob"}
         ],
         "edges": [
           {"from": "123", "to": "124", "type": "friend"}
         ]
       }
       ```

### 6. **Geospatial Data**
   - **Location Data**:
     - **Example**: GPS coordinates of delivery trucks.
     - **Columns**: `TruckID`, `Latitude`, `Longitude`, `Timestamp`.
     - **Sample**:
       ```
       TRK_001, 40.7128, -74.0060, 2023-08-13 08:00:00
       TRK_002, 34.0522, -118.2437, 2023-08-13 08:05:00
       ```

### 7. **Image Data**
   - **Image Files**:
     - **Example**: A dataset of labeled images for computer vision tasks.
     - **Structure**: 
       - `Image`: A file representing the image.
       - `Label`: A category label for the image (e.g., "cat", "dog").
     - **File**: `image_001.jpg` with label `cat`.

### 8. **Audio Data**
   - **Audio Files**:
     - **Example**: A collection of audio recordings for speech recognition.
     - **Files**: `audio_001.wav`, `audio_002.wav`.
     - **Metadata**: 
       - `Duration`: Length of the audio.
       - `Transcript`: Text transcription of the speech.

### 9. **Video Data**
   - **Video Files**:
     - **Example**: Security camera footage.
     - **Files**: `video_001.mp4`.
     - **Metadata**: 
       - `Duration`: Length of the video.
       - `Timestamp`: When the video was recorded.

### 10. **Streaming Data**
   - **Real-Time Data Streams**:
     - **Example**: Real-time tweets from Twitter API.
     - **Structure**: JSON objects with fields like `tweet_id`, `user_id`, `timestamp`, `text`.

These examples can help students understand how to handle, process, and analyze different types of data in various data engineering contexts.

# -------------------------------------------------------------------------

# 1. Structured Data

In [15]:
import pandas as pd

df = pd.read_csv('DataSets/Housing.csv')

df.head(10)

Unnamed: 0,rownames,price,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea
0,1,42000,5850,3,1,2,yes,no,yes,no,no,1,no
1,2,38500,4000,2,1,1,yes,no,no,no,no,0,no
2,3,49500,3060,3,1,1,yes,no,no,no,no,0,no
3,4,60500,6650,3,1,2,yes,yes,no,no,no,0,no
4,5,61000,6360,2,1,1,yes,no,no,no,no,0,no
5,6,66000,4160,3,1,1,yes,yes,yes,no,yes,0,no
6,7,66000,3880,3,2,2,yes,no,yes,no,no,2,no
7,8,69000,4160,3,1,3,yes,no,no,no,no,0,no
8,9,83800,4800,3,1,1,yes,yes,yes,no,no,0,no
9,10,88500,5500,3,2,4,yes,yes,no,no,yes,1,no


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   rownames  546 non-null    int64 
 1   price     546 non-null    int64 
 2   lotsize   546 non-null    int64 
 3   bedrooms  546 non-null    int64 
 4   bathrms   546 non-null    int64 
 5   stories   546 non-null    int64 
 6   driveway  546 non-null    object
 7   recroom   546 non-null    object
 8   fullbase  546 non-null    object
 9   gashw     546 non-null    object
 10  airco     546 non-null    object
 11  garagepl  546 non-null    int64 
 12  prefarea  546 non-null    object
dtypes: int64(7), object(6)
memory usage: 55.6+ KB


In [24]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("MySparkApp") \
    .master("local[*]") \
    .getOrCreate()

# Verify the Spark session is created
print("Spark Session Created")

# Read the CSV file into a DataFrame
spdf = spark.read.csv('DataSets/Housing.csv', header=True, inferSchema=True)

# Show the DataFrame
spdf.show()

# Stop the Spark session
spark.stop()

24/08/14 09:32:14 INFO SparkContext: Running Spark version 3.5.2
24/08/14 09:32:14 INFO SparkContext: OS info Mac OS X, 14.5, aarch64
24/08/14 09:32:14 INFO SparkContext: Java version 22.0.1
24/08/14 09:32:14 INFO ResourceUtils: No custom resources configured for spark.driver.
24/08/14 09:32:14 INFO SparkContext: Submitted application: MySparkApp
24/08/14 09:32:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/08/14 09:32:14 INFO ResourceProfile: Limiting resource is cpu
24/08/14 09:32:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/08/14 09:32:14 INFO SecurityManager: Changing view acls to: paritoshsharma
24/08/14 09:32:14 INFO SecurityManager: Changing modify acls to: paritoshsharma
24/08/14 09:32:14 INFO SecurityMan

Spark Session Created


24/08/14 09:32:14 INFO SparkContext: Starting job: csv at DirectMethodHandleAccessor.java:103
24/08/14 09:32:14 INFO DAGScheduler: Got job 0 (csv at DirectMethodHandleAccessor.java:103) with 1 output partitions
24/08/14 09:32:14 INFO DAGScheduler: Final stage: ResultStage 0 (csv at DirectMethodHandleAccessor.java:103)
24/08/14 09:32:14 INFO DAGScheduler: Parents of final stage: List()
24/08/14 09:32:14 INFO DAGScheduler: Missing parents: List()
24/08/14 09:32:14 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at csv at DirectMethodHandleAccessor.java:103), which has no missing parents
24/08/14 09:32:14 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.5 KiB, free 434.2 MiB)
24/08/14 09:32:14 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.4 KiB, free 434.2 MiB)
24/08/14 09:32:14 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.2:59849 (size: 6.4 KiB, free: 434.4 MiB)
24/08/14 0

+--------+-----+-------+--------+-------+-------+--------+-------+--------+-----+-----+--------+--------+
|rownames|price|lotsize|bedrooms|bathrms|stories|driveway|recroom|fullbase|gashw|airco|garagepl|prefarea|
+--------+-----+-------+--------+-------+-------+--------+-------+--------+-----+-----+--------+--------+
|       1|42000|   5850|       3|      1|      2|     yes|     no|     yes|   no|   no|       1|      no|
|       2|38500|   4000|       2|      1|      1|     yes|     no|      no|   no|   no|       0|      no|
|       3|49500|   3060|       3|      1|      1|     yes|     no|      no|   no|   no|       0|      no|
|       4|60500|   6650|       3|      1|      2|     yes|    yes|      no|   no|   no|       0|      no|
|       5|61000|   6360|       2|      1|      1|     yes|     no|      no|   no|   no|       0|      no|
|       6|66000|   4160|       3|      1|      1|     yes|    yes|     yes|   no|  yes|       0|      no|
|       7|66000|   3880|       3|      2|     

24/08/14 09:32:14 INFO CodeGenerator: Code generated in 6.523042 ms
24/08/14 09:32:14 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2512 bytes result sent to driver
24/08/14 09:32:14 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 29 ms on 192.168.1.2 (executor driver) (1/1)
24/08/14 09:32:14 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
24/08/14 09:32:14 INFO DAGScheduler: ResultStage 2 (showString at DirectMethodHandleAccessor.java:103) finished in 0.033 s
24/08/14 09:32:14 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job
24/08/14 09:32:14 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished
24/08/14 09:32:14 INFO DAGScheduler: Job 2 finished: showString at DirectMethodHandleAccessor.java:103, took 0.033893 s
24/08/14 09:32:14 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/08/14 09:32:14 INFO SparkUI: Stopped Spark web UI at htt

In [107]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("SparkSQLExample") \
    .getOrCreate()

# Create a DataFrame
data = [("John", 28), ("Anna", 24), ("Mike", 32)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Create a temporary view
df.createOrReplaceTempView("people")

# Run SQL query
result = spark.sql("SELECT * FROM people WHERE Age > 25")

# Show the result
result.show()

# Stop the SparkSession
# spark.stop()


24/08/16 09:53:23 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
24/08/16 09:53:23 INFO CodeGenerator: Code generated in 25.061334 ms
24/08/16 09:53:23 INFO SparkContext: Starting job: showString at DirectMethodHandleAccessor.java:103
24/08/16 09:53:23 INFO DAGScheduler: Got job 4 (showString at DirectMethodHandleAccessor.java:103) with 1 output partitions
24/08/16 09:53:23 INFO DAGScheduler: Final stage: ResultStage 4 (showString at DirectMethodHandleAccessor.java:103)
24/08/16 09:53:23 INFO DAGScheduler: Parents of final stage: List()
24/08/16 09:53:23 INFO DAGScheduler: Missing parents: List()
24/08/16 09:53:23 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[18] at showString at DirectMethodHandleAccessor.java:103), which has no missing parents
24/08/16 09:53:23 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 13.1 KiB, free 434.1 MiB)
24/08/16 09:53:24 INFO MemoryStore: Block broadcas

+----+---+
|Name|Age|
+----+---+
|John| 28|
|Mike| 32|
+----+---+



24/08/16 09:53:24 INFO CodeGenerator: Code generated in 13.195667 ms
24/08/16 09:53:24 INFO PythonRunner: Times: total = 55, boot = 44, init = 11, finish = 0
24/08/16 09:53:24 INFO Executor: Finished task 0.0 in stage 4.0 (TID 5). 1942 bytes result sent to driver
24/08/16 09:53:24 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 197 ms on 192.168.1.2 (executor driver) (1/1)
24/08/16 09:53:24 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
24/08/16 09:53:24 INFO DAGScheduler: ResultStage 4 (showString at DirectMethodHandleAccessor.java:103) finished in 0.277 s
24/08/16 09:53:24 INFO DAGScheduler: Job 4 is finished. Cancelling potential speculative or zombie tasks for this job
24/08/16 09:53:24 INFO TaskSchedulerImpl: Killing all running tasks in stage 4: Stage finished
24/08/16 09:53:24 INFO DAGScheduler: Job 4 finished: showString at DirectMethodHandleAccessor.java:103, took 0.284259 s
24/08/16 09:53:24 INFO SparkContext: Starting j

In [3]:
import datetime
import mysql.connector

cnx = mysql.connector.connect(user='root', password='root',
                              host='localhost',
                              database='ML_Learning')
cursor = cnx.cursor()

query = ("SELECT first_name, last_name, hire_date FROM employees "
         "WHERE hire_date BETWEEN %s AND %s")

hire_start = datetime.date(1999, 1, 1)
hire_end = datetime.date(1999, 12, 31)

cursor.execute(query, (hire_start, hire_end))

for (first_name, last_name, hire_date) in cursor:
  print("{}, {} was hired on {:%d %b %Y}".format(
    last_name, first_name, hire_date))

cursor.close()
cnx.close()

ProgrammingError: 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)

In [123]:
import pandas as pd
import glob

# Path pattern to match CSV files
path_pattern = '/Users/paritoshsharma/Desktop/Machine_Learning_Training/Ram/Data/*.csv'

# Use glob to get the list of CSV files
all_files = glob.glob(path_pattern)

# Read and concatenate all CSV files
tf = pd.concat((pd.read_csv(f, skiprows = 10) for f in all_files), ignore_index=True)

# Display the combined DataFrame
print(tf.head())


# tf = pd.read_csv('/Users/paritoshsharma/Desktop/Machine_Learning_Training/Ram/Data/input_Input_Data_13.csv', skiprows=10)

   area   employee id  in time(days)  out time  total income
0      3           16             24        68          4284
1     44           16             29        10          3412
2     32           16             33        98          3105
3     70           16             36        59          3454
4     49           16             48         9          1691


In [129]:
tf.groupby(['employee id']).mean('area ')

Unnamed: 0_level_0,area,in time(days),out time,total income
employee id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
13,26.823206,27.49319,46.922472,2195.509953
14,50.799633,24.800157,49.676794,3257.570718
15,51.127554,24.681509,50.937402,3278.567051
16,50.451545,24.683866,49.543216,3235.18989
17,49.509429,24.886328,49.617077,3266.893138


In [124]:
sptf = spark.createDataFrame(tf)

In [125]:
sptf.show()

+-----+-----------+-------------+--------+------------+
|area |employee id|in time(days)|out time|total income|
+-----+-----------+-------------+--------+------------+
|    3|         16|           24|      68|        4284|
|   44|         16|           29|      10|        3412|
|   32|         16|           33|      98|        3105|
|   70|         16|           36|      59|        3454|
|   49|         16|           48|       9|        1691|
|   74|         16|           40|      89|        3912|
|   87|         16|           43|      78|        3183|
|   69|         16|           23|      28|        4651|
|   79|         16|           27|      28|        1666|
|   12|         16|            5|       3|        4965|
|   63|         16|           24|      78|        1562|
|   30|         16|           39|      84|        1622|
|   96|         16|           12|      62|        2818|
|   39|         16|           33|      56|        1549|
|   41|         16|           28|       8|      

24/08/16 10:07:04 INFO CodeGenerator: Code generated in 12.777042 ms
24/08/16 10:07:04 INFO SparkContext: Starting job: showString at DirectMethodHandleAccessor.java:103
24/08/16 10:07:04 INFO DAGScheduler: Got job 13 (showString at DirectMethodHandleAccessor.java:103) with 1 output partitions
24/08/16 10:07:04 INFO DAGScheduler: Final stage: ResultStage 15 (showString at DirectMethodHandleAccessor.java:103)
24/08/16 10:07:04 INFO DAGScheduler: Parents of final stage: List()
24/08/16 10:07:04 INFO DAGScheduler: Missing parents: List()
24/08/16 10:07:04 INFO DAGScheduler: Submitting ResultStage 15 (MapPartitionsRDD[44] at showString at DirectMethodHandleAccessor.java:103), which has no missing parents
24/08/16 10:07:04 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 13.8 KiB, free 433.9 MiB)
24/08/16 10:07:04 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 6.8 KiB, free 433.9 MiB)
24/08/16 10:07:04 INFO BlockManagerIn

In [126]:
sptf2 = sptf.withColumn("Average", sptf['area ']*sptf['total income']) \
                  .withColumn("OUTAverage", sptf['area ']*sptf['out time']) \
                  .withColumn("INAverage", sptf['area ']*sptf['in time(days)'])

In [127]:
sptf2.show()

+-----+-----------+-------------+--------+------------+-------+----------+---------+
|area |employee id|in time(days)|out time|total income|Average|OUTAverage|INAverage|
+-----+-----------+-------------+--------+------------+-------+----------+---------+
|    3|         16|           24|      68|        4284|  12852|       204|       72|
|   44|         16|           29|      10|        3412| 150128|       440|     1276|
|   32|         16|           33|      98|        3105|  99360|      3136|     1056|
|   70|         16|           36|      59|        3454| 241780|      4130|     2520|
|   49|         16|           48|       9|        1691|  82859|       441|     2352|
|   74|         16|           40|      89|        3912| 289488|      6586|     2960|
|   87|         16|           43|      78|        3183| 276921|      6786|     3741|
|   69|         16|           23|      28|        4651| 320919|      1932|     1587|
|   79|         16|           27|      28|        1666| 131614|  

24/08/16 10:07:14 INFO CodeGenerator: Code generated in 11.764625 ms
24/08/16 10:07:14 INFO SparkContext: Starting job: showString at DirectMethodHandleAccessor.java:103
24/08/16 10:07:14 INFO DAGScheduler: Got job 14 (showString at DirectMethodHandleAccessor.java:103) with 1 output partitions
24/08/16 10:07:14 INFO DAGScheduler: Final stage: ResultStage 16 (showString at DirectMethodHandleAccessor.java:103)
24/08/16 10:07:14 INFO DAGScheduler: Parents of final stage: List()
24/08/16 10:07:14 INFO DAGScheduler: Missing parents: List()
24/08/16 10:07:14 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[46] at showString at DirectMethodHandleAccessor.java:103), which has no missing parents
24/08/16 10:07:14 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 15.2 KiB, free 433.9 MiB)
24/08/16 10:07:14 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 7.2 KiB, free 433.9 MiB)
24/08/16 10:07:14 INFO BlockManagerIn

In [119]:
sptf2.groupBy('employee id').sum('Average').show(truncate=False) #/ sptf.groupBy('employee id').sum('area ')

+-----------+------------+
|employee id|sum(Average)|
+-----------+------------+
|13         |239901859   |
+-----------+------------+



24/08/16 10:00:03 INFO DAGScheduler: Registering RDD 34 (showString at DirectMethodHandleAccessor.java:103) as input to shuffle 1
24/08/16 10:00:03 INFO DAGScheduler: Got map stage job 11 (showString at DirectMethodHandleAccessor.java:103) with 8 output partitions
24/08/16 10:00:03 INFO DAGScheduler: Final stage: ShuffleMapStage 12 (showString at DirectMethodHandleAccessor.java:103)
24/08/16 10:00:03 INFO DAGScheduler: Parents of final stage: List()
24/08/16 10:00:03 INFO DAGScheduler: Missing parents: List()
24/08/16 10:00:03 INFO DAGScheduler: Submitting ShuffleMapStage 12 (MapPartitionsRDD[34] at showString at DirectMethodHandleAccessor.java:103), which has no missing parents
24/08/16 10:00:03 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 40.1 KiB, free 434.0 MiB)
24/08/16 10:00:03 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 18.7 KiB, free 434.0 MiB)
24/08/16 10:00:03 INFO BlockManagerInfo: Added broadcast_1

In [105]:
tf['Average'] = tf['area ']*tf['total income']
tf['OUTAverage'] = tf['area ']*tf['out time']
tf['INAverage'] = tf['area ']*tf['in time(days)']

In [106]:
tf.head()

Unnamed: 0,area,employee id,in time(days),out time,total income,Average,OUTAverage,INAverage
0,10,13,50,69,1494,14940,690,500
1,200,13,36,46,1296,259200,9200,7200
2,200,13,15,46,1195,239000,9200,3000
3,200,13,11,46,1192,238400,9200,2200
4,30,13,18,47,1267,38010,1410,540


In [102]:
ave = tf.groupby('employee id').sum('Average').reset_index()['Average'] / tf.groupby('employee id').sum('area ').reset_index()['area ']
ave = tf.groupby('employee id').sum('OUTAverage').reset_index()['OUTAverage'] / tf.groupby('employee id').sum('area ').reset_index()['area ']
ave = tf.groupby('employee id').sum('INAverage').reset_index()['INAverage'] / tf.groupby('employee id').sum('area ').reset_index()['area ']

In [101]:
tf.groupby('employee id').sum('Average')

Unnamed: 0_level_0,area,in time(days),out time,total income,Average
employee id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13,102411,104969,179150,8382457,239901859


In [99]:
tf.groupby('employee id').sum('Average').reset_index()

Unnamed: 0,employee id,area,in time(days),out time,total income,Average
0,13,102411,104969,179150,8382457,239901859


In [104]:
min(ave)

2342.5399517629944

In [55]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Read CSV with Header Starting from 11th Row") \
    .getOrCreate()

# Path to the CSV file
csv_file_path = "/Users/paritoshsharma/Desktop/Machine_Learning_Training/Ram/Data/input_Input_Data_16.csv"

# Read the entire CSV file as a text file
rdd = spark.sparkContext.textFile(csv_file_path)

# Filter out the first 10 rows (assuming they are null or irrelevant)
filtered_rdd = rdd.zipWithIndex().filter(lambda x: x[1] >= 10).keys()

# Convert the RDD back to a DataFrame, splitting by the delimiter (e.g., comma)
df = filtered_rdd.map(lambda line: line.split(",")).toDF()

# Assign the 11th row as the header (if not manually specifying the schema)
header = df.first()

# Create a DataFrame with the header
df_with_header = df.toDF(*header)

# Filter out the header row from the DataFrame
df_with_header = df_with_header.filter(df_with_header[header[0]] != header[0])

# Show the DataFrame
df_with_header.show()



24/08/15 09:59:34 INFO SparkContext: Running Spark version 3.5.2
24/08/15 09:59:34 INFO SparkContext: OS info Mac OS X, 14.5, aarch64
24/08/15 09:59:34 INFO SparkContext: Java version 22.0.1
24/08/15 09:59:34 INFO ResourceUtils: No custom resources configured for spark.driver.
24/08/15 09:59:34 INFO SparkContext: Submitted application: Read CSV with Header Starting from 11th Row
24/08/15 09:59:34 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/08/15 09:59:34 INFO ResourceProfile: Limiting resource is cpu
24/08/15 09:59:34 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/08/15 09:59:34 INFO SecurityManager: Changing view acls to: paritoshsharma
24/08/15 09:59:34 INFO SecurityManager: Changing modify acls to: paritoshsharma
2

+-----+-----------+-------------+--------+------------+
|area |employee id|in time(days)|out time|total income|
+-----+-----------+-------------+--------+------------+
|    3|         16|           24|      68|        4284|
|   44|         16|           29|      10|        3412|
|   32|         16|           33|      98|        3105|
|   70|         16|           36|      59|        3454|
|   49|         16|           48|       9|        1691|
|   74|         16|           40|      89|        3912|
|   87|         16|           43|      78|        3183|
|   69|         16|           23|      28|        4651|
|   79|         16|           27|      28|        1666|
|   12|         16|            5|       3|        4965|
|   63|         16|           24|      78|        1562|
|   30|         16|           39|      84|        1622|
|   96|         16|           12|      62|        2818|
|   39|         16|           33|      56|        1549|
|   41|         16|           28|       8|      

24/08/15 09:59:34 INFO PythonRunner: Times: total = 440, boot = 437, init = 2, finish = 1
24/08/15 09:59:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1410 bytes result sent to driver
24/08/15 09:59:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 447 ms on 192.168.1.2 (executor driver) (1/2)
24/08/15 09:59:34 INFO PythonRunner: Times: total = 441, boot = 438, init = 2, finish = 1
24/08/15 09:59:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 53021
24/08/15 09:59:34 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1410 bytes result sent to driver
24/08/15 09:59:34 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 449 ms on 192.168.1.2 (executor driver) (2/2)
24/08/15 09:59:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
24/08/15 09:59:34 INFO DAGScheduler: ResultStage 0 (zipWithIndex at /var/folders/_y/cjj70hnd0gvd6szbplrbn2f00000gn/T/ipykernel_7476/1412836714

# Semi-Structured Data

## NoSQL or JSON Data

![image.png](attachment:61c98c90-7db3-4d73-b8a5-98d956f5fdd5.png)

In [170]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = "mongodb+srv://p4r1t0sh:I2PHxVGKTDEL3opv@cluster0.okdze.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [176]:
# Select the database (Replace 'mydatabase' with your database name)
db = client['sample_mflix']

# Select the collection
collection1 = db['comments']
collection2 = db['movies']

# Fetch all documents from the collection
comments = list(collection1.find())
movies = list(collection2.find())

# Fetch all documents from the collection
document = collection.find_one()

print(document)

{'_id': ObjectId('573a1390f29313caabcd42e8'), 'plot': 'A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.', 'genres': ['Short', 'Western'], 'runtime': 11, 'cast': ['A.C. Abadie', "Gilbert M. 'Broncho Billy' Anderson", 'George Barnes', 'Justus D. Barnes'], 'poster': 'https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI@._V1_SY1000_SX677_AL_.jpg', 'title': 'The Great Train Robbery', 'fullplot': "Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.", 'languages': ['English'], 'released': datetime.datetime(1903, 12, 1, 0, 0), 'directors': ['Edwin S. Porter'], 'rated': 'TV-G', 'awards': {'wins': 1, 'nominations': 0, 'text': '1 win.'},

In [159]:
# Example: Fetch documents where the 'runtime' field is greater than 90
query = {"runtime": {"$gt": 25}}
d0 = collection.find(query)

for d in d0:
    print(d)

In [160]:
# Fetch all documents from the collection
document = collection.find()

print(document)

for d in document:
    print(d)
client.close()

<pymongo.cursor.Cursor object at 0x1611dea00>
{'_id': ObjectId('5a9427648b0beebeb69579e7'), 'name': 'Mercedes Tyler', 'email': 'mercedes_tyler@fakegmail.com', 'movie_id': ObjectId('573a1390f29313caabcd4323'), 'text': 'Eius veritatis vero facilis quaerat fuga temporibus. Praesentium expedita sequi repellat id. Corporis minima enim ex. Provident fugit nisi dignissimos nulla nam ipsum aliquam.', 'date': datetime.datetime(2002, 8, 18, 4, 56, 7)}
{'_id': ObjectId('5a9427648b0beebeb69579f5'), 'name': 'John Bishop', 'email': 'john_bishop@fakegmail.com', 'movie_id': ObjectId('573a1390f29313caabcd446f'), 'text': 'Id error ab at molestias dolorum incidunt. Non deserunt praesentium dolorem nihil. Optio tempora vel ut quas.\nMinus dicta numquam quasi. Rem totam cumque at eum. Ullam hic ut ea magni.', 'date': datetime.datetime(1975, 1, 21, 0, 31, 22)}
{'_id': ObjectId('5a9427648b0beebeb6957a21'), 'name': "Jaqen H'ghar", 'email': 'tom_wlaschiha@gameofthron.es', 'movie_id': ObjectId('573a1390f29313ca

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [177]:
# Convert the list of dictionaries into a DataFrame
df_comments = pd.DataFrame(comments)

df_movies = pd.DataFrame(movies)

# # Display the final DataFrame
df_comments.head()

Unnamed: 0,_id,name,email,movie_id,text,date
0,5a9427648b0beebeb69579e7,Mercedes Tyler,mercedes_tyler@fakegmail.com,573a1390f29313caabcd4323,Eius veritatis vero facilis quaerat fuga tempo...,2002-08-18 04:56:07
1,5a9427648b0beebeb69579f5,John Bishop,john_bishop@fakegmail.com,573a1390f29313caabcd446f,Id error ab at molestias dolorum incidunt. Non...,1975-01-21 00:31:22
2,5a9427648b0beebeb6957a21,Jaqen H'ghar,tom_wlaschiha@gameofthron.es,573a1390f29313caabcd516c,Minima odit officiis minima nam. Aspernatur id...,1981-11-08 04:32:25
3,5a9427648b0beebeb6957a22,Taylor Scott,taylor_scott@fakegmail.com,573a1390f29313caabcd4eaf,Iure laboriosam quo et necessitatibus sed. Id ...,1970-11-15 05:54:02
4,5a9427648b0beebeb6957a38,Yara Greyjoy,gemma_whelan@gameofthron.es,573a1390f29313caabcd587d,Nobis incidunt ea tempore cupiditate sint. Ita...,2012-11-26 11:00:57


# Working with APIs

API stands for Application Programming Interface. It is a set of rules and protocols that allows one piece of software to communicate with another. An API defines the methods and data formats that applications can use to communicate with each other.

### Key Concepts of APIs
- Endpoints: Specific URLs within an API where specific functions are exposed. For example, https://api.example.com/users might be an endpoint for accessing user data.
- Requests and Responses: An API works through requests and responses. A client sends a request to the server, and the server processes this request and sends back a response. The request often includes parameters or data needed by the server to process the request.

### Methods (or Verbs):
- GET: Retrieve data from the server.
- POST: Send data to the server to create a new resource.
- PUT: Update an existing resource on the server.
- DELETE: Remove a resource from the server.
Headers: Metadata sent with requests and responses, often used for things like authentication (e.g., API keys, tokens) or specifying the data format (e.g., JSON).

#### Data Formats: APIs typically use data formats like JSON (JavaScript Object Notation) or XML (eXtensible Markup Language) to structure the data being exchanged.

### Where and How Do We Use APIs?
APIs are used everywhere in modern software development:
- Web Services: APIs allow different web services (like weather services, social media platforms, or payment gateways) to interact with your application.
- Microservices: In a microservices architecture, different services within an application communicate with each other using APIs.
- Third-Party Integrations: APIs allow your application to integrate with third-party services like Google Maps, Twitter, or payment processors like Stripe or PayPal.
- Mobile Applications: Mobile apps use APIs to fetch data from the cloud or interact with web services.

In [8]:
import requests

# Your TMDb API key
api_key = "04c0772e44c21a24054411c45ed9ea39"

# Base URL for TMDb API
base_url = "https://api.themoviedb.org/3"

# Endpoint for popular movies
endpoint = f"{base_url}/movie/popular"

# Parameters for the request
params = {
    "api_key": api_key,
    "language": "en-US",
    "page": 2  # You can change this to get different pages of results
}

# Send the request to the TMDb API
response = requests.get(endpoint, params=params)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()
    movies = data['results']
    
    # Print the title of each movie
    for movie in movies:
        print(f"Title: {movie['title']}, Release Date: {movie['release_date']}")
else:
    print(f"Failed to fetch data: {response.status_code}")


Title: Jackpot!, Release Date: 2024-08-13
Title: Fly Me to the Moon, Release Date: 2024-07-10
Title: House of Ga'a, Release Date: 2024-07-26
Title: Breaking and Re-entering, Release Date: 2024-02-08
Title: Paradox Effect, Release Date: 2024-06-27
Title: The Mouse Trap, Release Date: 2024-08-23
Title: Descendants: The Rise of Red, Release Date: 2024-07-11
Title: Thelma, Release Date: 2024-06-21
Title: The Ministry of Ungentlemanly Warfare, Release Date: 2024-04-18
Title: Justice League: Crisis on Infinite Earths Part Three, Release Date: 2024-07-16
Title: Alien, Release Date: 1979-05-25
Title: Godzilla x Kong: The New Empire, Release Date: 2024-03-27
Title: Non Negotiable, Release Date: 2024-07-25
Title: Kung Fu Panda 4, Release Date: 2024-03-02
Title: Avengers: Infinity War, Release Date: 2018-04-25
Title: My Fault, Release Date: 2023-06-08
Title: The Union, Release Date: 2024-08-15
Title: Shimmy: The First Monkey King, Release Date: 2023-04-21
Title: Escape from the 21st Century, Rele

In [13]:
print(movies)

[{'adult': False, 'backdrop_path': '/pzFbYJfqGKlGxOsDIIsUi6YxVQ.jpg', 'genre_ids': [28, 35, 878], 'id': 1094138, 'original_language': 'en', 'original_title': 'Jackpot!', 'overview': "In the near future, a 'Grand Lottery' has been established - the catch: kill the winner before sundown to legally claim their multi-billion dollar jackpot. When Katie Kim mistakenly finds herself with the winning ticket, she reluctantly joins forces with amateur lottery protection agent Noel Cassidy who must get her to sundown in exchange for a piece of her prize.", 'popularity': 1025.787, 'poster_path': '/wWWlclyWf4PLq9hOf8X5joVEJ6r.jpg', 'release_date': '2024-08-13', 'title': 'Jackpot!', 'video': False, 'vote_average': 6.546, 'vote_count': 207}, {'adult': False, 'backdrop_path': '/zB0g0VaRKHfRrvBT4ouHK5W967W.jpg', 'genre_ids': [10749, 35], 'id': 956842, 'original_language': 'en', 'original_title': 'Fly Me to the Moon', 'overview': "Sparks fly in all directions as marketing maven Kelly Jones, brought in t

In [16]:
df = pd.DataFrame(movies)

In [17]:
df.head(10)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/pzFbYJfqGKlGxOsDIIsUi6YxVQ.jpg,"[28, 35, 878]",1094138,en,Jackpot!,"In the near future, a 'Grand Lottery' has been...",1025.787,/wWWlclyWf4PLq9hOf8X5joVEJ6r.jpg,2024-08-13,Jackpot!,False,6.546,207
1,False,/zB0g0VaRKHfRrvBT4ouHK5W967W.jpg,"[10749, 35]",956842,en,Fly Me to the Moon,Sparks fly in all directions as marketing mave...,609.666,/gjk8YdXpItoC1in53FCrZMFIuBx.jpg,2024-07-10,Fly Me to the Moon,False,7.033,286
2,False,/c3rwwFFVbkyEI6wPtpPd9lvovPW.jpg,"[28, 36]",1311550,en,House of Ga'a,"At the height of the Oyo Empire, the ferocious...",647.296,/6yK9hmS641NMwRkR1wWAALWI34t.jpg,2024-07-26,House of Ga'a,False,5.908,38
3,False,/5P0FeTl1mO65Xa5xjTIG6Iqdls0.jpg,"[28, 35]",1166073,zh,還錢,A group of thieves attempts to pull off a “rev...,507.672,/qYdS4KIdCmJr3WB4CJYnoGaTVbv.jpg,2024-02-08,Breaking and Re-entering,False,5.976,21
4,False,/xYyPLClpJiA5pq687pGjRel5qgf.jpg,"[28, 53]",1064375,en,Paradox Effect,An innocent woman is forced to confront a dang...,595.166,/koJFEW997sLjpu4e7wmFioA2mhL.jpg,2024-06-27,Paradox Effect,False,6.444,18
5,False,/yhEu41kat5sZ5QhCdZnh3Vypu5V.jpg,"[27, 53, 35]",1225377,en,The Mouse Trap,"It's Alex's 21st Birthday, but she's stuck at ...",591.674,/3ovFaFeojLFIl5ClqhtgYMDS8sE.jpg,2024-08-23,The Mouse Trap,False,2.6,4
6,False,/dn3gbDpXPSwC6saMJOHkCiFA9jn.jpg,"[14, 12, 10751, 35]",974262,en,Descendants: The Rise of Red,After the Queen of Hearts incites a coup on Au...,585.491,/t9u9FWpKlZcp0Wz1qPeV5AIzDsk.jpg,2024-07-11,Descendants: The Rise of Red,False,7.0,257
7,False,/wkPPRIducGfsbaUPsWfw0MCQdX7.jpg,"[28, 35, 12]",1051891,en,Thelma,When 93-year-old Thelma Post gets duped by a p...,571.267,/rUcuageYgv9SsJoWuc0seRWG6JC.jpg,2024-06-21,Thelma,False,7.049,72
8,False,/s5znBQmprDJJ553IMQfwEVlfroH.jpg,"[28, 35, 10752]",799583,en,The Ministry of Ungentlemanly Warfare,"During World War II, the British Army assigns ...",468.313,/8aF0iAKH9MJMYAZdi0Slg77RYa2.jpg,2024-04-18,The Ministry of Ungentlemanly Warfare,False,7.148,844
9,False,/dsGwCEO8tda4FlgHKvL95f0oQbH.jpg,"[16, 878, 28]",1209290,en,Justice League: Crisis on Infinite Earths Part...,Now fully revealed as the ultimate threat to e...,474.444,/a3q8NkM8uTh9E23VsbUOdDSbBeN.jpg,2024-07-16,Justice League: Crisis on Infinite Earths Part...,False,7.411,163


In [19]:
df.columns

Index(['adult', 'backdrop_path', 'genre_ids', 'id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'release_date', 'title', 'video', 'vote_average', 'vote_count'],
      dtype='object')

In [22]:
df = df.drop(['backdrop_path', 'poster_path'], axis = 1)

In [26]:
df['popularity'] = round(df['popularity'],1)

In [29]:
df['vote_average'] = round(df['vote_average'],1)

In [30]:
df.head(10)

Unnamed: 0,adult,genre_ids,id,original_language,original_title,overview,popularity,release_date,title,video,vote_average,vote_count
0,False,"[28, 35, 878]",1094138,en,Jackpot!,"In the near future, a 'Grand Lottery' has been...",1025.8,2024-08-13,Jackpot!,False,6.5,207
1,False,"[10749, 35]",956842,en,Fly Me to the Moon,Sparks fly in all directions as marketing mave...,609.7,2024-07-10,Fly Me to the Moon,False,7.0,286
2,False,"[28, 36]",1311550,en,House of Ga'a,"At the height of the Oyo Empire, the ferocious...",647.3,2024-07-26,House of Ga'a,False,5.9,38
3,False,"[28, 35]",1166073,zh,還錢,A group of thieves attempts to pull off a “rev...,507.7,2024-02-08,Breaking and Re-entering,False,6.0,21
4,False,"[28, 53]",1064375,en,Paradox Effect,An innocent woman is forced to confront a dang...,595.2,2024-06-27,Paradox Effect,False,6.4,18
5,False,"[27, 53, 35]",1225377,en,The Mouse Trap,"It's Alex's 21st Birthday, but she's stuck at ...",591.7,2024-08-23,The Mouse Trap,False,2.6,4
6,False,"[14, 12, 10751, 35]",974262,en,Descendants: The Rise of Red,After the Queen of Hearts incites a coup on Au...,585.5,2024-07-11,Descendants: The Rise of Red,False,7.0,257
7,False,"[28, 35, 12]",1051891,en,Thelma,When 93-year-old Thelma Post gets duped by a p...,571.3,2024-06-21,Thelma,False,7.0,72
8,False,"[28, 35, 10752]",799583,en,The Ministry of Ungentlemanly Warfare,"During World War II, the British Army assigns ...",468.3,2024-04-18,The Ministry of Ungentlemanly Warfare,False,7.1,844
9,False,"[16, 878, 28]",1209290,en,Justice League: Crisis on Infinite Earths Part...,Now fully revealed as the ultimate threat to e...,474.4,2024-07-16,Justice League: Crisis on Infinite Earths Part...,False,7.4,163


#### Assignment 3

1. Get data from TMDB for first 5 pages.
2. Convert genre_id column to proper readable format.

In [None]:
NLP (Natural Language Processing) - Technique
NLTK (Natural Language Tool-Kit) - Framework
Python - Language