<a href="https://colab.research.google.com/github/hunar0710/Real-time-data-ingestion-Final-Project/blob/main/Real_time_data_ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install required Python packages: PySpark for Spark processing, delta-spark for Delta Lake support
!pip install pyspark==3.4.1 delta-spark faker schedule

Collecting pyspark==3.4.1
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting delta-spark
  Downloading delta_spark-4.0.0-py3-none-any.whl.metadata (1.9 kB)
Collecting faker
  Downloading faker-37.4.2-py3-none-any.whl.metadata (15 kB)
Collecting schedule
  Downloading schedule-1.2.2-py3-none-any.whl.metadata (3.8 kB)
INFO: pip is looking at multiple versions of delta-spark to determine which version is compatible with other requirements. This could take a while.
Collecting delta-spark
  Downloading delta_spark-3.3.2-py3-none-any.whl.metadata (2.2 kB)
  Downloading delta_spark-3.3.1-py3-none-any.whl.metadata (1.9 kB)
  Downloading delta_spark-3.3.0-py3-none-any.whl.metadata (2.0 kB)
  Downloading delta_spark-3.2.1-py3-none-any.whl.metadata (1.9 kB)
  Downloading delta_spark-3.2.0-py3-none-any.whl.metadata

In [None]:
# Step 2: Create and configure the Spark session with Delta Lake support and set the timezone to Asia/Kolkata
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
builder = SparkSession.builder \
    .appName("DeltaTableColab") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.warehouse.dir", "/content/delta-warehouse")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")

In [None]:
# Step 3: Generate fake data using the Faker library
from faker import Faker
import pandas as pd
fake = Faker()
def generate_fake_data(n=5):
    data = [{
        "Name": fake.name(),
        "Address": fake.address(),
        "Email": fake.email()
    } for _ in range(n)]
    return pd.DataFrame(data)

In [None]:
# Step 4: Append the generated fake data to a Delta Lake table
from pyspark.sql import DataFrame
def append_to_delta(df: pd.DataFrame, path: str):
    spark_df = spark.createDataFrame(df)
    spark_df.write.format("delta").mode("append").save(path)
    print("Appended data:")
    spark.read.format("delta").load(path).show()

In [None]:
# Step 5: Generate an HTML summary of the DataFrame
def generate_html_summary(df: pd.DataFrame):
    return df.to_html(index=False)

In [None]:
# Step 6: Track and display the version history of the Delta table
from delta.tables import DeltaTable
def track_versions(path: str):
    delta_table = DeltaTable.forPath(spark, path)
    print(" Delta Table Version History:")
    delta_table.history().show(truncate=False)

In [None]:
# Step 7: Define and run the complete data ingestion pipeline
def run_pipeline():
    delta_path = "/content/delta-table"
    df = generate_fake_data()
    append_to_delta(df, delta_path)
    html_summary = generate_html_summary(df)
    from IPython.display import display, HTML
    display(HTML(html_summary))
    track_versions(delta_path)

In [None]:
# Step 8: Schedule the ingestion pipeline to run every 5 minutes
import schedule
import time
schedule.every(5).minutes.do(run_pipeline)
print(" Starting the ingestion pipeline (runs every 5 mins)...")
while True:
    schedule.run_pending()
    time.sleep(1)

 Starting the ingestion pipeline (runs every 5 mins)...
Appended data:
+---------------+--------------------+--------------------+
|           Name|             Address|               Email|
+---------------+--------------------+--------------------+
|   Steven Allen|6842 Christy Port...|   cyang@example.net|
| Katie Matthews|8861 Pham Skyway\...| karla52@example.com|
|James Patterson|2624 Carter Gatew...|  xbaker@example.net|
|      Gail Hays|Unit 8568 Box 484...|stephanie33@examp...|
| Yesenia Osborn|9097 Johnson Rout...|tsingleton@exampl...|
+---------------+--------------------+--------------------+



Name,Address,Email
Gail Hays,Unit 8568 Box 4846\nDPO AP 35916,stephanie33@example.net
Yesenia Osborn,"9097 Johnson Route Suite 366\nWest Isaiah, NH 10337",tsingleton@example.com
Steven Allen,"6842 Christy Port\nHeatherberg, AK 27157",cyang@example.net
Katie Matthews,"8861 Pham Skyway\nLake Patrick, GU 81336",karla52@example.com
James Patterson,"2624 Carter Gateway Suite 618\nNew Jerry, MP 54352",xbaker@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|0      |2025-07-18 16:50:37.223|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |null       |Serializable  |true         |

Name,Address,Email
Christian Lindsey,Unit 0888 Box 7027\nDPO AA 57970,joshua09@example.org
Paula Meyer,"7114 Andrew Fall\nDavidton, CT 74802",anthony08@example.org
Kathleen Conway,"07623 Susan Mission Apt. 616\nSouth John, VT 40848",lgonzalez@example.net
Paul Jones,"07908 Smith Street Suite 755\nKevinport, AR 97504",marcusfreeman@example.net
Sarah Johnson,"3577 Phillips Lights\nJohnsonborough, TX 53523",myersjessica@example.org


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|1      |2025-07-18 16:55:51.942|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |0          |Serializable  |true         |

Name,Address,Email
Tamara Sherman,"41242 Justin Motorway Suite 815\nTinamouth, ID 68145",jhodges@example.net
Eric Cochran,"976 Greg Haven Suite 157\nNew Michellestad, MO 24907",matthew55@example.org
Lori Smith,"728 Gomez Shoals\nNorth Jasmine, MS 46372",katie14@example.net
Emma Turner,Unit 0596 Box 9910\nDPO AA 33450,scott68@example.org
Damon Mckenzie,"1183 Walls Run Suite 952\nWest Elizabeth, FL 05787",robinsonmalik@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|2      |2025-07-18 17:00:59.524|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |1          |Serializable  |true         |

Name,Address,Email
Heather Nash,USNS Chapman\nFPO AP 68263,lucassamantha@example.net
Donald Jenkins,"938 Stevenson Rest Apt. 743\nKarenmouth, WY 48369",charles64@example.org
Tyler Lee,"27604 Anna Plains Suite 501\nEmilymouth, CA 89359",matthew39@example.com
Stephanie Soto,"8213 Drew Lakes Apt. 404\nWest Michelle, UT 28167",annette16@example.com
Jonathan Parker,"1607 James Rest Apt. 269\nTinaville, ME 65506",adamscassandra@example.org


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|3      |2025-07-18 17:06:06.289|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |2          |Serializable  |true         |

Name,Address,Email
Pamela Johnson,"3627 Reese Rest Suite 216\nMarshallville, FL 06523",bbishop@example.org
John Sharp,"9460 Lane Pike Apt. 651\nJenniferburgh, AK 00761",pchristensen@example.net
Garrett Parker,"42359 Thomas Glens\nPort Matthewburgh, NE 88755",tonya28@example.org
Toni Santiago,"PSC 7477, Box 3085\nAPO AE 38195",jclark@example.net
Joseph Jackson,"48922 Jennifer Parkways Apt. 143\nNewmanfort, WA 32139",kyle27@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|4      |2025-07-18 17:11:12.465|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |3          |Serializable  |true         |

Name,Address,Email
Andrea Lewis,"PSC 0290, Box 0446\nAPO AE 64053",logan22@example.org
Melissa Lee,"572 Meredith Burgs\nJohnsonport, MP 28108",brenda95@example.net
Jasmine Day,"390 Carey Causeway Suite 669\nNorth Kiaramouth, AL 23268",fordrussell@example.org
Steven Ward,"0915 Jesse Brooks Apt. 376\nWest Deborah, DC 89390",kingcaleb@example.org
Jason Day,"5702 Christopher Stream Apt. 564\nSamanthaview, MT 25322",tburgess@example.org


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|5      |2025-07-18 17:16:19.513|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |4          |Serializable  |true         |

Name,Address,Email
Samantha Santos,"91712 Stuart Coves Suite 346\nNorth Ronald, AZ 56532",erica51@example.org
Nichole Moore,"05116 Melissa Plains\nWest Katherineland, CO 80644",jenniferjohnson@example.com
Kyle Castaneda,"40312 Christensen Loop Apt. 288\nCameronfurt, WA 27021",wcompton@example.net
Jermaine Griffin,"7359 Richard Hollow Apt. 208\nWest Christine, HI 54823",ryanbrittney@example.net
Barbara Carson,USS Garrett\nFPO AA 23006,william96@example.org


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|6      |2025-07-18 17:21:26.42 |null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |5          |Serializable  |true         |

Name,Address,Email
Julie Mendez,"92319 Nichole Ways Apt. 924\nPhilipport, PR 01865",ryanmcmahon@example.net
Daniel Barker,"704 Hall Extensions Apt. 406\nSuzanneview, IN 14008",josephsmith@example.net
Amy Stone,"46111 Barry Common\nRonniefort, WA 92019",rickeyleonard@example.org
April Ortiz,"513 David Dam Apt. 143\nLawrenceberg, AR 97900",perrykevin@example.net
Aaron Willis,"337 Reynolds Centers\nWeavershire, NH 06559",meganpatterson@example.org


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|7      |2025-07-18 17:26:32.478|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |6          |Serializable  |true         |

Name,Address,Email
Theresa Jones,"53971 Collins Fords\nNorth Trevor, NH 63708",lewisamanda@example.com
William Richardson,"76600 Cameron Lodge\nPort Dominique, IN 98141",valerie55@example.com
Kelli Haynes,"862 Luis Loaf Apt. 860\nBarnettbury, ME 14975",christian89@example.net
Kimberly Ford,"17413 Anderson Way\nDanielton, MI 26001",martin20@example.net
Kristin Robertson,"80057 Fisher Landing Suite 833\nWhitneyhaven, NC 89188",carlos65@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|8      |2025-07-18 17:31:38.84 |null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |7          |Serializable  |true         |

Name,Address,Email
Nathaniel Welch,"3346 Debra Landing Apt. 517\nLake Jessicaburgh, AZ 09664",justinjones@example.com
Michael Griffin,Unit 7847 Box 6825\nDPO AA 68933,lauren88@example.com
Susan Ramirez,"861 Rice Fields\nBarkerport, MA 67143",murraykelly@example.com
Rose Stone,"026 Perez Shores\nChristinachester, AZ 02162",hgonzales@example.com
Seth Rivera,"49801 Matthews Ridges\nNorth Tyler, MP 63730",ddaugherty@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|9      |2025-07-18 17:36:44.664|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |8          |Serializable  |true         |

Name,Address,Email
Kimberly Tucker,"595 Maynard Tunnel Apt. 841\nCrawfordburgh, CO 58748",phyllis17@example.net
Xavier Edwards,"86915 Kenneth Estate Apt. 485\nPort Stephen, ME 85624",xchen@example.net
Michael Thompson PhD,"3986 Shawn Cape Apt. 379\nWilsonmouth, RI 18481",robertmoore@example.net
Timothy Williams,"0149 Jeremiah Junction Apt. 453\nConnieshire, FL 66519",jessica60@example.net
Jennifer Fields,"53272 Mendoza Alley\nCharleschester, GA 47461",nlozano@example.net


 Delta Table Version History:
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                           |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+-----------------------------------+----+--------+---------+-----------+--------------+-------------+-----------------------------------------------------------+------------+-----------------------------------+
|10     |2025-07-18 17:41:50.074|null  |null    |WRITE    |{mode -> Append, partitionBy -> []}|null|null    |null     |9          |Serializable  |true         |