#**Week6 File ingestion and schema validation**

##Large Russian News Dataset
##**russian_news.csv**  
https://www.kaggle.com/datasets/vyhuholl/large-russian-news-dataset?select=russian_news.csv



In [1]:
#! pip install "git+https://github.com/h2oai/datatable.git"

In [2]:
!pip install pip --upgrade

! pip install numpy
! pip install cython



In [3]:
! pip install pyspark py4j



In [4]:
! pip install datatable



In [5]:
! pip install --upgrade pandas
! pip install --upgrade dask



In [6]:
! pip install ray



In [7]:
import zipfile
import seaborn as sns
import os
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import dask.dataframe as dd
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, to_timestamp
import multiprocessing as mp
import csv
import datatable as dt
import yaml
import gzip
import ray
from ray.data.dataset import Dataset
import time
from subprocess import check_call
import dask.dataframe as dd
import warnings
warnings.filterwarnings('ignore')

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [8]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/'
%cd '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6


In [9]:
# Storing the path of the data file from the Google drive
# path = '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/credit_card_fraud.csv.zip'
path = 'russian_news.csv.zip'

# The data is provided as a zip file so we need to extract the files from the zip file
with zipfile.ZipFile(path, 'r') as zip_ref:

    zip_ref.extractall()

##Reading file with **Pandas with chunk**

In [10]:
# Timer start
start = time.time()

csv_files = [f for f in os.listdir() if f.endswith('.csv')]

# Check the size of the CSV file and read it
if csv_files:
    csv_file_path = csv_files[0]

# Get the file size
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # Size in GB

# Read the file with Chunk, specify the issue of row
    try:
        chunk_size = 100000  # setting the appropriate size of chunk
        for chunk in pd.read_csv(csv_file_path, chunksize=chunk_size, on_bad_lines='skip'):
            print(chunk.head())  # show the head of each chunk
        print("File read successfully.")
    except pd.errors.ParserError as e:
        print("ParserError encountered:", e)
    else:
        print("No CSV file found in the ZIP archive.")

print(time.time() - start, ' seconds')

File size: 1.99 GB
     source                                           title  \
0  lenta.ru                                  Синий богатырь   
1  lenta.ru  Загитова согласилась вести «Ледниковый период»   
2  lenta.ru       Объяснена опасность однообразного питания   
3  lenta.ru                      «Предохраняться? А зачем?»   
4  lenta.ru     Ефремов систематически употреблял наркотики   

                                                text  \
0  В 1930-е годы Советский Союз охватила лихорадк...   
1  Олимпийская чемпионка по фигурному катанию  Ал...   
2  Российский врач-диетолог Римма Мойсенко объясн...   
3  В 2019 году телеканал «Ю» запустил адаптацию з...   
4  Актер  Михаил Ефремов  систематически употребл...   

                        date tags  
0  2020-08-29 21:01:00+00:00  Все  
1  2020-08-31 17:04:00+00:00  Все  
2  2020-08-31 17:07:00+00:00  Все  
3  2020-08-29 21:04:00+00:00  Все  
4  2020-08-31 15:27:00+00:00  Все  
          source                                 

##Reading file with PySpark

In [11]:
# Timer start
start = time.time()

# Create PySpark session
spark = SparkSession.builder \
    .appName("Data Cleaning") \
    .getOrCreate()

# get the list of CSV file
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

if csv_files:
    csv_file_path = csv_files[0]

    # Get the file size
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # Size in GB

    # Read the file with PySpark
    try:
        df = spark.read.csv(
            csv_file_path,
            header=True,  # True if there is a header row
            inferSchema=True,  # predict the type of data
            mode="DROPMALFORMED"  # skip incorrect row
        )

        # Show the first 5 rows
        df.show(5)
        print("File read successfully.")

    except Exception as e:
        print("Error reading CSV file with PySpark:", e)

else:
    print("No CSV files found.")

print(f"Execution Time: {time.time() - start} seconds")

File size: 1.99 GB
+--------+--------------------+--------------------+--------------------+----+
|  source|               title|                text|                date|tags|
+--------+--------------------+--------------------+--------------------+----+
|lenta.ru|Загитова согласил...|Олимпийская чемпи...|2020-08-31 17:04:...| Все|
|lenta.ru|Объяснена опаснос...|Российский врач-д...|2020-08-31 17:07:...| Все|
|lenta.ru|«Предохраняться? ...|В 2019 году телек...|2020-08-29 21:04:...| Все|
|lenta.ru|Ефремов системати...|Актер  Михаил Ефр...|2020-08-31 15:27:...| Все|
|lenta.ru|«Вы живете в мире...|27 августа выходи...|2020-08-26 21:02:...| Все|
+--------+--------------------+--------------------+--------------------+----+
only showing top 5 rows

File read successfully.
Execution Time: 58.492955446243286 seconds


##Reading file with **Dask**

In [12]:
# start timer
start = time.time()

# get the list of CSV file
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

if csv_files:
    csv_file_path = csv_files[0]

    # check the size file
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # show the file size with GB

    # read the csv file with Dask
    try:
        # In order to skip the incorrect row, set to show the dtype explicitly
        df = dd.read_csv(
            csv_file_path,
            assume_missing=True,  # Possibility if there are missing date
            dtype=str  # Read the data type as strings all
        )

        # check the data type and then extract only correct rows
        def clean_bad_rows(df_chunk):
            try:
                # As need, impliment filtering in specific columns
                return df_chunk
            except Exception as e:
                print("Skipping a problematic chunk:", e)
                return None

        # Clean the dataframe
        clean_df = df.map_partitions(clean_bad_rows)

        # Show the first 5 rows (in Dask, there is need to have a trigger with calculation.)
        print(clean_df.head())
        print("File read successfully.")

    except Exception as e:
        print("Error reading CSV file with Dask:", e)

else:
    print("No CSV files found.")

print(f"Execution Time: {time.time() - start} seconds")

File size: 1.99 GB
     source                                           title  \
0  lenta.ru                                  Синий богатырь   
1  lenta.ru  Загитова согласилась вести «Ледниковый период»   
2  lenta.ru       Объяснена опасность однообразного питания   
3  lenta.ru                      «Предохраняться? А зачем?»   
4  lenta.ru     Ефремов систематически употреблял наркотики   

                                                text  \
0  В 1930-е годы Советский Союз охватила лихорадк...   
1  Олимпийская чемпионка по фигурному катанию  Ал...   
2  Российский врач-диетолог Римма Мойсенко объясн...   
3  В 2019 году телеканал «Ю» запустил адаптацию з...   
4  Актер  Михаил Ефремов  систематически употребл...   

                        date tags  
0  2020-08-29 21:01:00+00:00  Все  
1  2020-08-31 17:04:00+00:00  Все  
2  2020-08-31 17:07:00+00:00  Все  
3  2020-08-29 21:04:00+00:00  Все  
4  2020-08-31 15:27:00+00:00  Все  
File read successfully.
Execution Time: 2.1510493

##Reading file with **Dask with chunks**

In [13]:
# Timer start
start = time.time()

# Get the list of CSV file
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

if csv_files:
    csv_file_path = csv_files[0]

    # Get the file size
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # Size in GB

    # Read the file with Dask
    try:
        # Read the file while processing the incorrect rows
        df = dd.read_csv(
            csv_file_path,
            blocksize=25e6,  # Set the chank size as 25MB
            assume_missing=True,  # # Possibility if there are missing date
            quoting=3,  # avoid the errors with ignoring quoting
            encoding="utf-8",  # Read with encoding UTF-8
            on_bad_lines="skip",  # Skip the rows with issues
            sep=",",  # explicit ,
            lineterminator="\n"  # explicit \n
        )

        # check the info of dataframe
        print(f"Number of partitions: {df.npartitions}")

        # process each partition
        for partition in df.to_delayed():
            print(partition.compute().head())  # show the head of each partition

        print("File read successfully.")

    except Exception as e:
        print("Error reading CSV file with Dask:", e)

else:
    print("No CSV files found.")

print(f"Execution Time: {time.time() - start} seconds")

File size: 1.99 GB
Number of partitions: 85
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

##Reading file with **Ray**

In [14]:
# Initiate Ray
ray.init(ignore_reinit_error=True)

# Start timer
start = time.time()

# Get the list of CSV file
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

if csv_files:
    csv_file_path = csv_files[0]

    # Get the file size
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # Size in GB

    # Read the file with Ray
    try:
        # read CSV
        ds: Dataset = ray.data.read_csv(
            csv_file_path,
            override_num_blocks=85  # assign the numbers of partition(adj as need）
        )

        # Show dataset summary
        print(f"Number of rows: {ds.count()}")
        print("Schema:")
        print(ds.schema())

        # Show data of each partition
        for batch in ds.iter_batches(batch_size=5):
            print(batch)
            break  # Show the first batch

        print("File read successfully.")

    except Exception as e:
        print("Error reading CSV file with Ray:", e)

else:
    print("No CSV files found.")

print(f"Execution Time: {time.time() - start} seconds")

# shutdown ray
ray.shutdown()

2024-12-12 23:11:32,594	INFO worker.py:1821 -- Started a local Ray instance.


File size: 1.99 GB


2024-12-12 23:11:35,509	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-12_23-11-28_662550_31876/logs/ray-data
2024-12-12 23:11:35,510	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadCSV] -> AggregateNumRows[AggregateNumRows]


Running 0: 0.00 row [00:00, ? row/s]

- ReadCSV->SplitBlocks(6) 1: 0.00 row [00:00, ? row/s]

- AggregateNumRows 2: 0.00 row [00:00, ? row/s]

2024-12-12 23:11:37,192	ERROR streaming_executor_state.py:485 -- An exception was raised from a task of operator "ReadCSV->SplitBlocks(6)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
2024-12-12 23:11:37,258	ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-12-12 23:11:37,263	ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/plan.py", line 429, in execute_to_iterator
    bundle_iter = itertools.chain([next(gen)], gen)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution

Error reading CSV file with Ray: [36mray::ReadCSV->SplitBlocks(6)()[39m (pid=33678, ip=172.28.0.12)
  File "pyarrow/ipc.pxi", line 705, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker. This can mean the data file contains cell values spanning multiple lines; please consider enabling the option 'newlines_in_values'.

The above exception was the direct cause of the following exception:

[36mray::ReadCSV->SplitBlocks(6)()[39m (pid=33678, ip=172.28.0.12)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/map_operator.py", line 482, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/usr/local/lib/python3.10/dist-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/usr/local/lib/python3.10/dist-packages/ray/data/

##Basic Validation

In [15]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 5 entries, source to tags
dtypes: string(5)

In [16]:
# Uninstall/Install PySpark
!pip uninstall pyspark -y
!pip install -q pyspark

Found existing installation: pyspark 3.5.3
Uninstalling pyspark-3.5.3:
  Successfully uninstalled pyspark-3.5.3


In [17]:
spark = SparkSession.builder \
    .appName("Date Conversion") \
    .master("local[*]") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .getOrCreate()

# get the list of CSV file
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

if csv_files:
    csv_file_path = csv_files[0]

    # read the CSV file with PySpark
    df = spark.read.csv(
          csv_file_path,
          header=True,  # True if there is a header row
          inferSchema=True,  # predict the type of data
          mode="DROPMALFORMED"  # skip incorrect row
      )

    # convert the column 'date' to DateType
    df = df.withColumn("date_as_date", to_date(df["date"]))

    # Drop the original 'date' column and rename 'date_as_date' to 'date'
    df = df.drop("date").withColumnRenamed("date_as_date", "date")

    # Rearrange columns: ensure 'date' is to the left of 'tags'
    columns = [col for col in df.columns if col not in ('date', 'tags')]  # Exclude 'date' and 'tags'
    columns.append('date')  # Add 'date' at the end
    columns.append('tags')  # Add 'tags' at the end
    df = df.select(columns)

    # Re-select the columns in the new order
    df = df.select(columns)

    # show the result
    df.show(truncate=False)

    # check the schema
    df.printSchema()

+--------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##Generate testutility.py

In [18]:
import testutility as util

In [19]:
%%writefile testutility.py
import logging
import os
import subprocess
import yaml
import pandas as pd
import datetime
import gc
import re


################
# File Reading #
################

def read_config_file(filepath):
    with open(filepath, 'r') as stream:
        try:
            return yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            logging.error(exc)


def replacer(string, char):
    pattern = char + '{2,}'
    string = re.sub(pattern, char, string)
    return string

def col_header_val(df,table_config):
    '''
    replace whitespaces in the column
    and standardized column names
    '''
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace('[^\w]','_',regex=True)
    df.columns = list(map(lambda x: x.strip('_'), list(df.columns)))
    df.columns = list(map(lambda x: replacer(x,'_'), list(df.columns)))
    expected_col = list(map(lambda x: x.lower(),  table_config['columns']))
    expected_col.sort()
    df.columns =list(map(lambda x: x.lower(), list(df.columns)))
    df = df.reindex(sorted(df.columns), axis=1)
    if len(df.columns) == len(expected_col) and list(expected_col)  == list(df.columns):
        print("column name and column length validation passed")
        return 1
    else:
        print("column name and column length validation failed")
        mismatched_columns_file = list(set(df.columns).difference(expected_col))
        print("Following File columns are not in the YAML file",mismatched_columns_file)
        missing_YAML_file = list(set(expected_col).difference(df.columns))
        print("Following YAML columns are not in the file uploaded",missing_YAML_file)
        logging.info(f'df columns: {df.columns}')
        logging.info(f'expected columns: {expected_col}')
        return 0

Overwriting testutility.py


##Generate YAML file

In [20]:
%%writefile file.yaml
file_type: csv
dataset_name: file
file_name: russian_news
table_name: edsurv
inbound_delimiter: ","
outbound_delimiter: "|"
skip_leading_rows: 1
columns:
    - source
    - title
    - text
    - date
    - tags

Overwriting file.yaml


In [21]:
# Read config file
config_data = util.read_config_file("file.yaml")

In [22]:
config_data['file_type']

'csv'

In [23]:
config_data['inbound_delimiter']

','

In [24]:
#inspecting data of config file
config_data

{'file_type': 'csv',
 'dataset_name': 'file',
 'file_name': 'russian_news',
 'table_name': 'edsurv',
 'inbound_delimiter': ',',
 'outbound_delimiter': '|',
 'skip_leading_rows': 1,
 'columns': ['source', 'title', 'text', 'date', 'tags']}

In [25]:
# Normal reading process of the file
# import pandas as pd
# df_sample = pd.read_csv("/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/birds.csv",delimiter=',')
# df_sample.head()

In [26]:
# Normal reading process of the file (Pandas)

start = time.time()

# Get all CSV files in the directory
csv_files = [f for f in os.listdir() if f.endswith('.csv')]

# Check if there are any CSV files and read the first one
if csv_files:
    csv_file_path = csv_files[0]

    # Get the file size
    file_size_gb = os.path.getsize(csv_file_path) / (1024 * 1024 * 1024)
    print(f"File size: {file_size_gb:.2f} GB")  # Size in GB

    try:
        # Read the entire CSV file
        df = pd.read_csv(csv_file_path, on_bad_lines='skip')
        print(df.head())  # Display the first few rows
        print("File read successfully.")
    except pd.errors.ParserError as e:
        print("ParserError encountered:", e)
    except Exception as e:
        print("An error occurred:", e)
else:
    print("No CSV file found.")

print(time.time() - start, 'seconds')

File size: 1.99 GB
     source                                           title  \
0  lenta.ru                                  Синий богатырь   
1  lenta.ru  Загитова согласилась вести «Ледниковый период»   
2  lenta.ru       Объяснена опасность однообразного питания   
3  lenta.ru                      «Предохраняться? А зачем?»   
4  lenta.ru     Ефремов систематически употреблял наркотики   

                                                text  \
0  В 1930-е годы Советский Союз охватила лихорадк...   
1  Олимпийская чемпионка по фигурному катанию  Ал...   
2  Российский врач-диетолог Римма Мойсенко объясн...   
3  В 2019 году телеканал «Ю» запустил адаптацию з...   
4  Актер  Михаил Ефремов  систематически употребл...   

                        date tags  
0  2020-08-29 21:01:00+00:00  Все  
1  2020-08-31 17:04:00+00:00  Все  
2  2020-08-31 17:07:00+00:00  Все  
3  2020-08-29 21:04:00+00:00  Все  
4  2020-08-31 15:27:00+00:00  Все  
File read successfully.
64.17567181587219 seconds

In [27]:
current_directory = os.getcwd()
print("current directory:", current_directory)

current directory: /content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6


In [28]:
# Sets the directory path to direct to the specified location
target_directory = "/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/"

# Change the current directory to the specified location
os.chdir(target_directory)

# Get and display the current directory after change
current_directory = os.getcwd()
print("current directory:", current_directory)

current directory: /content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6


In [29]:
# read the file using config file
file_type = config_data['file_type']
source_file = "./" + config_data['file_name'] + f'.{file_type}'
print("",source_file)

#df = pd.read_csv(source_file,config_data['inbound_delimiter'])
df = pd.read_csv(source_file, delimiter=config_data['inbound_delimiter'])

df.head()

 ./russian_news.csv


Unnamed: 0,source,title,text,date,tags
0,lenta.ru,Синий богатырь,В 1930-е годы Советский Союз охватила лихорадк...,2020-08-29 21:01:00+00:00,Все
1,lenta.ru,Загитова согласилась вести «Ледниковый период»,Олимпийская чемпионка по фигурному катанию Ал...,2020-08-31 17:04:00+00:00,Все
2,lenta.ru,Объяснена опасность однообразного питания,Российский врач-диетолог Римма Мойсенко объясн...,2020-08-31 17:07:00+00:00,Все
3,lenta.ru,«Предохраняться? А зачем?»,В 2019 году телеканал «Ю» запустил адаптацию з...,2020-08-29 21:04:00+00:00,Все
4,lenta.ru,Ефремов систематически употреблял наркотики,Актер Михаил Ефремов систематически употребл...,2020-08-31 15:27:00+00:00,Все


##Validate number of columns and column name of ingested file with YAML.

In [30]:
#validate the header of the file
util.col_header_val(df,config_data)

column name and column length validation passed


1

In [31]:
print("columns of files are:" ,df.columns)
print("columns of YAML are:" ,config_data['columns'])

columns of files are: Index(['source', 'title', 'text', 'date', 'tags'], dtype='object')
columns of YAML are: ['source', 'title', 'text', 'date', 'tags']


In [32]:
if util.col_header_val(df,config_data)==0:
    print("validation failed")

else:
    print("col validation passed")

column name and column length validation passed
col validation passed


In [33]:
df.head()

Unnamed: 0,source,title,text,date,tags
0,lenta.ru,Синий богатырь,В 1930-е годы Советский Союз охватила лихорадк...,2020-08-29 21:01:00+00:00,Все
1,lenta.ru,Загитова согласилась вести «Ледниковый период»,Олимпийская чемпионка по фигурному катанию Ал...,2020-08-31 17:04:00+00:00,Все
2,lenta.ru,Объяснена опасность однообразного питания,Российский врач-диетолог Римма Мойсенко объясн...,2020-08-31 17:07:00+00:00,Все
3,lenta.ru,«Предохраняться? А зачем?»,В 2019 году телеканал «Ю» запустил адаптацию з...,2020-08-29 21:04:00+00:00,Все
4,lenta.ru,Ефремов систематически употреблял наркотики,Актер Михаил Ефремов систематически употребл...,2020-08-31 15:27:00+00:00,Все


##Write the file in pipe separated text file (|) in gz format.  

In [34]:
output_news_path = '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/output_file.csv.gz'


df.to_csv(output_news_path, sep='|', index=False, compression='gzip')

print(f"File '{output_news_path}' has been written in gzipped pipe-separated format.")

File '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/output_file.csv.gz' has been written in gzipped pipe-separated format.


In [35]:
df_gz = pd.read_csv(output_news_path, sep='|', nrows=1000)  # Load first 1000 rows to check
df_gz

Unnamed: 0,source,title,text,date,tags
0,lenta.ru,Синий богатырь,В 1930-е годы Советский Союз охватила лихорадк...,2020-08-29 21:01:00+00:00,Все
1,lenta.ru,Загитова согласилась вести «Ледниковый период»,Олимпийская чемпионка по фигурному катанию Ал...,2020-08-31 17:04:00+00:00,Все
2,lenta.ru,Объяснена опасность однообразного питания,Российский врач-диетолог Римма Мойсенко объясн...,2020-08-31 17:07:00+00:00,Все
3,lenta.ru,«Предохраняться? А зачем?»,В 2019 году телеканал «Ю» запустил адаптацию з...,2020-08-29 21:04:00+00:00,Все
4,lenta.ru,Ефремов систематически употреблял наркотики,Актер Михаил Ефремов систематически употребл...,2020-08-31 15:27:00+00:00,Все
...,...,...,...,...,...
995,lenta.ru,Раскрыты подробности нахождения Чикатило в тюрьме,"Серийный убийца Андрей Чикатило , находясь в ...",2020-07-07 17:41:00+00:00,Все
996,lenta.ru,Наблюдатели от ОБСЕ пропустят выборы в Белорус...,Наблюдатели от Парламентской ассамблеи ОБСЕ ...,2020-07-07 17:30:00+00:00,Все
997,lenta.ru,Музеи поборолись за звание обладателя экспонат...,Музеи мира поборолись за звание обладателя экс...,2020-07-07 17:21:00+00:00,Все
998,lenta.ru,Диетолог рассказал о еде перед сном без вреда ...,"Доктор медицинских наук, директор Самарского Н...",2020-07-07 17:17:00+00:00,Все


##Summary of the file

In [None]:
import os
import gzip
import csv
from io import StringIO

file_path = '/content/drive/MyDrive/Data_Glacier_Data_Analyst_2024/week6/output_file.csv.gz'
file_size = os.path.getsize(file_path)


with gzip.open(file_path, 'rt', encoding='utf-8') as fOpen:

    data = fOpen.read()
    data_io = StringIO(data)
    reader = csv.reader(data_io)


    header = next(reader, None)
    if header:
        column_count = len(header)


    row_count = sum(1 for _ in data_io)

print("total number of rows:", row_count)
print("total number of columns:", column_count)
print("file size:", file_size, "bytes")