<a href="https://colab.research.google.com/github/peterlulu666/Data-Analytics-Using-Python/blob/main/Data_Analytics_Using_Python_Week_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QUIZ

1. Working through each department one-by-one, taking data from each of their legacy systems (files, databases, logs), and putting it into a business-wide reporting or analytics platform (eg, a data lake, data warehouse, or some standardised repository format).
  - Transforming, loading real-time data
  - **Extracting, loading batch data**
  - Extracing, transforming, loading batch data

2. Capturing a constant stream of marketing data (customer interactions, tweets, pins, posts, likes / dislikes) into a marketing analytics platform to perform real-time sentiment analysis and maximise the campaign effectiveness.
  - **Extracting, transforming loading real-time data**
  - Transforming, loading real-time data
  - Extracting, transforming, loading batch data

3. Transferring data continuously from disparate systems into a data warehouse.
  - Extracting, transforming real-time data
  - Transforming, loading batch data
  - **Extracting, loading real-time data**





# Loading and Reading data

## Data loading and data formats

If you recall, earlier you learned the amount of data residing in various sources is vast and, also, constantly getting generated in our devices (including cloud). All of this data needs to be accessed at some point in time and it all starts with data ingestion!

Data ingestion is a process of reading and loading data into Python from various underlying data sources, such that data can then be processed and transformed as per the requirements of the application. Each kind of data source has their own protocol for transferring data and as an analyst you must understand the difference among them. Most of the time, the loaded data are available to us in the following formats:

Text data (CSV, JSON, Excel, etc) Web data (HTML, XML) Databases (SQL and NoSQL Data) Binary data formats

In this module, we will go into the details of some of the functions related to data ingestion for the CSV, JSON, HTML, and SQL data formats. We will also provide references to detailed documentation for other data types and different variations of data ingestion possible.

## Reading and writing text data

Due to its simple syntax for interacting with files, intuitive data structures, and convenient features like tuple packing and unpacking, Python has become the go-to language for text data. Pandas have several functions for reading tabular data as a DataFrame object. Here are some of these functions:

- read_csv() Load delimited data from a file, URL, or file-like object. ‘, ‘ – the comma is the default delimiter.
- read_table() Load delimited data from a file, URL, or file-like object. ‘\t ‘ – tab is the default delimiter.
- read_fwf() Read data in fixed-width column format as there is no delimiter.

In this course, you will most commonly use the read_csv() and read_table() functions. For the full list of I/O (input/output) functions available in Pandas, you can refer to the following link: [Go to: Pandas IO Tools Document](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) [1]

## CSV functions

A CSV (comma separated values) file is a type of plain text file. These csv files, the ‘comma separated’ files, are how columns and fields are identified in the value.

For example:

ColumnA_header, ColumnB_header, ColumC_header ColumnA_value1, ColumnB_value2, ColumnC_value2 ColumnA_value2, ColumB_value2, ColumnC_value3

Let us explore the details of the discussed CSV functions used to convert the text data into a DataFrame. If you think about it, you may realise that these functions should have parameters related to the following functionalities and features related to DataFrame:

### Indexing options

Column names need to be read from the file, or the user, or not at all, and use the default column name conventions

Row labels need to be considered as a particular data column in the file as the row label, or use the default row indexing For example: Naming individual columns and rows with unique names so that they are identified when called

### Type inferences and data conversion options

Value conversion options as defined by the user.

Custom input data to deal with missing value markers.

For example: Converting a particular value in a cell of the DataFrame; this also included replacing with a new value or inputting the default NaN for missing values.

### DateTime parsing-related options

Capability to combine the date and time information spread across multiple columns in the input data and merging the combined data into a single column in the result.

For example: Combining month, day, and year to produce a full date.

### Iteration-related options

Iterate over smaller chunks of data in the case of large files.

For example: Repeating the same customer ID for a particular name anywhere in the file.

### Data Issue-related options

Options to deal with data nuances

For example: Skipping rows or a footer, comments, or others such as numeric data with commas used in the representation

# Evaluating CSV functions

Now, let’s evaluate these functions and various options. There are different sample CSV files that are stored in the module directory (if you don’t have one created, create one for storing all the files you just downloaded under datasets in the start of this topic), and we will be using these files to perform the read operations and numerous variations of it.

You can see that there are comma-separated values in the file. We’ll show you how to use the function read_csv() and also check the various parameters available for this function.

## CSV’s read operation

For the following tasks, you first need to import both the NumPy and Pandas library. Code:

In [1]:
import pandas as pd
import numpy as np

Make sure you have the file named sample_data.txt downloaded on your machine. Next, enter the code snippet that demonstrates the use of the read_csv function: Code:

In [2]:
!wget https://github.com/peterlulu666/Data-Analytics-Using-Python/raw/main/dataset.zip
!unzip dataset.zip
%cd dataset/

--2021-05-12 00:42:54--  https://github.com/peterlulu666/Data-Analytics-Using-Python/raw/main/dataset.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/peterlulu666/Data-Analytics-Using-Python/main/dataset.zip [following]
--2021-05-12 00:42:54--  https://raw.githubusercontent.com/peterlulu666/Data-Analytics-Using-Python/main/dataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14938326 (14M) [application/zip]
Saving to: ‘dataset.zip’


2021-05-12 00:42:54 (39.0 MB/s) - ‘dataset.zip’ saved [14938326/14938326]

Archive:  dataset.zip
  inflating: dataset/file_json_split.json  
  inflating:

In [3]:
filename = "/content/dataset/sample_data.txt"
df= pd.read_csv(filename)
df

Unnamed: 0,a,b,c,d,comments
0,1,2,3,4,comment1
1,5,6,7,8,comment2
2,9,10,11,Adam,comment3
3,12,13,14,15,comment4
4,I1,16,17,18,comment5


Based on these results, the following can be observed:

- The first row of the file has been considered as the name of the columns by default.
- The default row labels (row indexes) have been considered (look at the values 0,1,2,3,4).
- Parsing of the values has happened automatically (i.e., we didn’t have to specify the ’,’ as the delimiter explicitly).

## Specifying the column names

Tables in excel generally come with the header rows that contain the information that either identifies the content of a particular column or the number of the column. There are scenarios where you may have to explicitly specify whether the header row in a table exists or not. The parameter header of the function controls this behaviour. So if you pass the value header=None, it will not consider the first row as the header and instead as a data record. In this case, column names for the DataFrame will automatically generate.

Make sure you have the file named sample_data_noheader.txt downloaded on your machine.

Next, we will now read this file and instruct the read_csv function to consider the first row as the data record.

Code:

In [4]:
filename = "/content/dataset/sample_data.txt"
df= pd.read_csv(filename, header=None)
df

Unnamed: 0,0,1,2,3,4
0,a,b,c,d,comments
1,1,2,3,4,comment1
2,5,6,7,8,comment2
3,9,10,11,Adam,comment3
4,12,13,14,15,comment4
5,I1,16,17,18,comment5


In this particular scenario, there could be a requirement to explicitly provide the column names instead of relying on the auto-indexing for column names.

In that case, we use the parameter names and pass the list of column names to the read_csv() function. When we pass the names=[list of column names], we don’t have to pass the parameter header=None to the read_csv function. The code snippet demonstrates this:

Code:

In [5]:
# We provide the header name
names=['a','b','c','d','comments']
df=pd.read_csv(filename, names=names)
df

Unnamed: 0,a,b,c,d,comments
0,a,b,c,d,comments
1,1,2,3,4,comment1
2,5,6,7,8,comment2
3,9,10,11,Adam,comment3
4,12,13,14,15,comment4
5,I1,16,17,18,comment5


## Specifying the row labels / row index from a column in the data file

Let us say you now have your columns labeled with individual header rows such as mango, apple, and grape. You might then decide to explicitly name that row as fruits. Python comes handy in such scenarios where you would like to explicitly specify the row labels (row indexes) instead of using the default indexes.

In the current example, assume that we want the comment section to be the row label of the DataFrame. This can be achieved by using the index_col parameter of the function and specifying the name of the column to be used as the row label. The code snippets demonstrate this function.

Code:

In [6]:
# We can use the comments as the row label name
df=pd.read_csv(filename, names=names, index_col='comments')
df

Unnamed: 0_level_0,a,b,c,d
comments,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
comments,a,b,c,d
comment1,1,2,3,4
comment2,5,6,7,8
comment3,9,10,11,Adam
comment4,12,13,14,15
comment5,I1,16,17,18


In [7]:
df.loc['comment1']

a    1
b    2
c    3
d    4
Name: comment1, dtype: object

Based on these results, the following can be observed:

- the comment column from the input data file has now been used to specify the row labels of the DataFrame

- you can access the first row of the DataFrame using the key ‘comment1 ‘.

## Hierarchical indexing

Let us say you want to include a hierarchical indexing functionality. Instead of one column (from the previous example, such as Fruits) considered as an index, you have a hierarchy of columns considered as indexes (such as fruit varieties Tropical and Exotic). These indexes can have the same values such as mango, apple, and grape; but now they are specific to the index names and can either be a tropical mango, apple, and grape or an exoctic mango, apple, and grape.

Make sure you have the file named sample_data_hierarchy.txt downloaded on your machine.

In this particular case, the first two columns together can be considered as the row index. For such a scenario, we would have to pass the list of columns to be considered as a hierarchical index to the read_csv function, using the index_col parameter. The code snippets demonstrate using the index_col parameter.

Code:

In [8]:
filename = "/content/dataset/sample_data_hierarchy.txt"
names=['I1','I2','col1','col2','col3','col4','comments']
df= pd.read_csv(filename, names=names, index_col=['I1','I2'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2,col3,col4,comments
I1,I2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,1,0,1,2,3,comment1
A,2,1,2,3,4,comment2
A,3,5,6,7,8,comment3
A,4,9,10,11,Adam,comment4
B,1,12,13,14,15,comment5
B,2,I1,16,17,18,comment6
B,3,I2,19,20,21,comment7


In [9]:
df.loc['A',1]

col1               0
col2               1
col3               2
col4               3
comments    comment1
Name: (A, 1), dtype: object

Based on these results, the following can be observed:

- we have read the data file, specified the column names bypassing the list of column names to the parameter names
- we have also specified the hierarchical indexing to be used by passing the list of column names to be considered as indexes to the parameter index_col
- we are accessing the first row of the DataFrame using the hierarchical index ‘A’,1.

# Specific functions

## Specifying the separator / delimiter explicitly

You were introduced to csv as comma separated values files, where the default delimiter was a comma (,). But, what if that default delimiter would not work? In such cases, you may have to specify the separator / delimiter explicitly when reading the data from the file.

For example: If white space is used as the separator instead of the default value comma ‘,’, you might need to write a separate code to call out the delimiter explicitly.

In this case, you can specify the separator to be used by using the parameter sep and passing the value of the separator. Consider the file named sample_data_noheader_space.txt (make sure you have this downloaded on your machine) where white space is used as the separator / delimiter. To read this file correctly in the DataFrame, we will use the parameter sep.

The code snippet demonstrates when we have not used the sep parameter, and all the values in the first column of the DataFrame have been read. 

Code:

In [10]:
filename = "/content/dataset/sample_data_noheader_space.txt"
names=['a','b','c','d','comments']
df= pd.read_csv(filename, header=None, names=names)
df

Unnamed: 0,a,b,c,d,comments
0,1 2 3 4 comment1,,,,
1,5 6 7 8 comment2,,,,
2,9 10 11 Adam comment3,,,,
3,12 13 14 15 comment4,,,,
4,I1 16 17 18 comment5,,,,


As you can see, when we have not used the parameter sep all the values have been read in the first column of the DataFrame.

Now, let’s specify the sep parameter and read the data correctly. 

Code:

In [11]:
# the delimiter is the space
df= pd.read_csv(filename, header=None, names=names, sep=" ")
df

Unnamed: 0,a,b,c,d,comments
0,1,2,3,4,comment1
1,5,6,7,8,comment2
2,9,10,11,Adam,comment3
3,12,13,14,15,comment4
4,I1,16,17,18,comment5


You can now see the data has been read into the DataFrame correctly.

## Skipping the specified rows while reading data from a file

There could be scenarios where you need to ignore specific rows while reading the data from a file.

For example:

- when there are fixed-line comments at the top
- when you have comments at the fixed-line numbers in the file
- when you say the footer line in the text file.

In this case, we will use the skiprows parameter of the read_csv function. We have to pass the index number of the row to this parameter. Whilst, in case we want to ignore multiple rows, we must be able to pass the list of row indexes to this parameter.

For this task, consider the file with this data. Make sure you have the file named sample_data_comments.txt downloaded on your machine.

In this particular case, we will ignore the first, second, third and sixth row bypassing the indexes 0,1,2 and 5 respectively. The code snippet below demonstrates this.

Code:

In [12]:
filename = "/content/dataset/sample_data_comments.txt"
names=['a','b','c','d','comments']
df= pd.read_csv(filename, header=None, names=names)
df

Unnamed: 0,a,b,c,d,comments
0,#This is the Row 1 to be ignored,,,,
1,#This is the Row 2 to be ignored,,,,
2,#This is the Comment Row 3 to be ignored,,,,
3,a,b,c,d,comments
4,1,2,3,4,comment1
5,#This is the 6th row in this file,to be ignored,,,
6,5,6,7,8,comment2
7,9,10,11,Adam,comment3
8,12,13,14,15,comment4
9,I1,16,17,18,comment5


As you can see, if we don’t specify the skiprows parameter, the DataFrame has not been appropriately parsed. The first three comment rows have been read and depending on whether there is one, the value has been parsed into either the first column or more.

We will now perform this read operation by correctly specifying the skiprows parameter. 

Code:

In [13]:
# remove specific row
df= pd.read_csv(filename, header=None, names=names, skiprows=[0,1,2,5])
df

Unnamed: 0,a,b,c,d,comments
0,a,b,c,d,comments
1,1,2,3,4,comment1
2,5,6,7,8,comment2
3,9,10,11,Adam,comment3
4,12,13,14,15,comment4
5,I1,16,17,18,comment5


You can see the DataFrame has now been correctly populated.

## Skipping the footer while reading a text file

Similarly, a text file may contain elements such as a footer that you may want to ignore as a row while reading the file because it might be irrelevant to the analysis you are conducting. In such cases, you will use the parameter skipfooter and pass an integer value equal to the number of lines to ignore while reading the file.

Make sure you have the file named sample_data_footer.txt downloaded on your machine.

Code:

In [14]:
# remove the last row
filename = "/content/dataset/sample_data_footer.txt"
names=['a','b','c','d','comments']
df= pd.read_csv(filename, header=None, names=names, skipfooter=1)
df

  after removing the cwd from sys.path.


Unnamed: 0,a,b,c,d,comments
0,a,b,c,d,comments
1,1,2,3,4,comment1
2,5,6,7,8,comment2
3,9,10,11,Adam,comment3
4,12,13,14,15,comment4
5,I1,16,17,18,comment5


# Handling missing data

## Handling missing data while reading data from a file

In the examples shown, you may have observed that when data is missing, the default behaviour is to replace the missing value with NaN. Handling missing values is an integral part of the file-parsing process, with some subtle nuances:

- Explicitly specifying the input values to be considered as missing values while reading the file. These are scenarios when a specific naming convention may denote the missing values in the input file. For example:

  - Some data might have ‘Missing’ to specify the missing data values
  - Some might even have ‘99999999’ to specify missing reading in case of numeric values
  - Some might even have a combination of both ‘Missing’ and ‘99999999’ for different data columns in the file

If you consider the next file example, some rows have missing values. Look for no value between two commas. Also, there are some rows where it is explicitly noted as ‘Missing’ (the notation could be anything; eg, NA instead of missing).

Make sure you have the file named sample_data_missingvalues.txt downloaded on your machine.

The code snippet demonstrates reading this file first without explicitly specifying that ‘Missing’ be considered as a missing value.

Code:

In [15]:
filename = "/content/dataset/sample_data_missingvalues.txt"
names=['a','b','c','d','comments','Value']
df= pd.read_csv(filename, header=None, names=names)
df

Unnamed: 0,a,b,c,d,comments,Value
0,1,2,,4,comment1,1
1,5,,7,8,comment2,2
2,9,foo,foo,Adam,,3
3,12,foo,14,15,comment4,999999
4,I1,16,17,18,comment5,999999
5,I2,Missing,18,19,Missing,4
6,I3,Missing,Missing,20,comments added,-1


Now, let’s read the same file by specifying the missing values using the na_values parameter.

Code:

In [16]:
# replace the Missing with the NaN
df= pd.read_csv(filename, header=None, names=names, na_values="Missing")
df

Unnamed: 0,a,b,c,d,comments,Value
0,1,2,,4,comment1,1
1,5,,7,8,comment2,2
2,9,foo,foo,Adam,,3
3,12,foo,14,15,comment4,999999
4,I1,16,17,18,comment5,999999
5,I2,,18,19,,4
6,I3,,,20,comments added,-1


Based on these results, the following can be observed:

- Row index 5, and columns b and comments – the string ‘Missing’ has been inferred as missing value NaN.
- Row index 6 and column b and c – the string ‘Missing’ has been inferred as missing value NaN.

We can extend this scenario and set the following requirements:

- Columns a,b,c,d and comments – string ‘Missing’ has to be treated as missing values.
- Column ‘Value’ – value 999999 has to be treated as a missing value.

This can be achieved by passing the dictionary object to na_values where:

- key – the name of columns.
- value – the list of values to be considered as the missing value for those columns.

The code snippets demonstrate how:

- The values ‘foo’ and ‘Missing’ will be considered as missing values for columns a,b,c,d and comments.
- The values -1 and 999999 will be considered as the missing value for the column Value.

Code:

In [17]:
# replace the foo to the Missing and replace the Missing to the NaN
# replace 999999 to -1 to NaN
dict_missing = {
	'a':['foo','Missing'],
	'b':['foo','Missing'],
	'c':['foo','Missing'],
	'd':['foo','Missing'],
	'comments':['foo','Missing'],
	'Value':[999999, -1]
}

df= pd.read_csv(filename, header=None, names=names, na_values=dict_missing)
df

Unnamed: 0,a,b,c,d,comments,Value
0,1,2.0,,4,comment1,1.0
1,5,,7.0,8,comment2,2.0
2,9,,,Adam,,3.0
3,12,,14.0,15,comment4,
4,I1,16.0,17.0,18,comment5,
5,I2,,18.0,19,,4.0
6,I3,,,20,comments added,


# Reading the text files in pieces

When processing huge files, you may want to:

- only read a small portion of the file to figure out the right set of arguments to be used
- read the file in smaller chunks and iterate over it sequentially.

There are two parameters we can use in this scenario:

- nrows – specifies the number of rows to be read.
- chunksize – specifies the number of rows to be read as one chunk. In this particular case, the read_csv will return a File Iterator instead of a DataFrame. We can use this iterator to create the resulting DataFrame or perform quantitative calculations.

Make sure you have the file named sample_data_large.csv downloaded on your machine.

The code snippet demonstrates using nrows.

Code:

In [18]:
# show the top 10 row
filename = "/content/dataset/sample_data_large.csv"
df= pd.read_csv(filename, nrows=10)
df

Unnamed: 0,Row Label,Column 1,Column 2,Column 3,Column 4,Values
0,R1,78,1128,9336,6111,21069
1,R2,77,2881,8379,801,822749
2,R3,62,2966,6989,241,869213
3,R4,91,8032,4315,6801,198173
4,R5,0,6983,7195,7615,464909
5,R6,17,2322,1139,2828,438359
6,R7,95,7205,4518,3437,54932
7,R8,32,6033,392,1725,785029
8,R9,1,6710,2368,1858,41489
9,R10,25,8678,1519,3866,481310


As you can see, only the 10 top rows have been read in this particular case. The next code snippet demonstrates the use of chunksize and File Iterator.

In [19]:
# show row number
df= pd.read_csv(filename)
df.shape[0]+1

1048576

In [20]:
# show 10 row
df= pd.read_csv(filename)
pd.set_option('display.max_rows', 10)
df

Unnamed: 0,Row Label,Column 1,Column 2,Column 3,Column 4,Values
0,R1,78,1128,9336,6111,21069
1,R2,77,2881,8379,801,822749
2,R3,62,2966,6989,241,869213
3,R4,91,8032,4315,6801,198173
4,R5,0,6983,7195,7615,464909
...,...,...,...,...,...,...
1048570,R1048571,9,5727,3920,3927,853689
1048571,R1048572,80,6758,9772,9576,519973
1048572,R1048573,95,7793,3635,5936,220089
1048573,R1048574,9,1984,5527,8149,823118


In [21]:
filename = "/content/dataset/sample_data_large.csv"
df= pd.read_csv(filename)
df

Unnamed: 0,Row Label,Column 1,Column 2,Column 3,Column 4,Values
0,R1,78,1128,9336,6111,21069
1,R2,77,2881,8379,801,822749
2,R3,62,2966,6989,241,869213
3,R4,91,8032,4315,6801,198173
4,R5,0,6983,7195,7615,464909
...,...,...,...,...,...,...
1048570,R1048571,9,5727,3920,3927,853689
1048571,R1048572,80,6758,9772,9576,519973
1048572,R1048573,95,7793,3635,5936,220089
1048573,R1048574,9,1984,5527,8149,823118


In [22]:
df.describe()

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Values
count,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0
mean,50.01918,5052.93,5052.788,5050.653,500244.5
std,29.14091,2858.973,2856.928,2857.718,288646.6
min,0.0,100.0,100.0,100.0,1.0
25%,25.0,2576.0,2578.0,2575.0,250370.0
50%,50.0,5054.0,5054.0,5054.0,500466.0
75%,75.0,7531.0,7525.0,7527.0,749893.5
max,100.0,10000.0,10000.0,10000.0,1000000.0


Now, look at the memory footprint of the DataFrame that you have just read.

Code:

In [23]:
print("The memory footprint of the DataFrame is: %f MB"%(df.memory_usage().sum()/(1024 * 1024)))

The memory footprint of the DataFrame is: 48.000076 MB


Assume the requirement is to calculate the sum of the values in the ‘Values’ column. This can be achieved without creating a DataFrame by using the File Iterator and performing the summation for the specific columns.

This would have a significantly smaller memory footprint and would result in code that performs better. We can access the File Iterator by:

using the chunksize parameter and specifying the integer value
specifying the iterator parameter and passing the boolean value True.
If you calculate this sum using DataFrame as well as the iterator, conducting memory profiling for both operations, you can expect that:

summation using DataFrame will require a larger memory footprint but less execution time
summation using Iterator will have lesser memory footprint (but more execution time because of multiple data movements to memory for calculation.
The code snippet demonstrates using Python’s memory and execution time profile to check these values.

If you don’t have the module install it using the following command in your terminal/shell:

In [24]:
!pip3 install memory_profiler

Collecting memory_profiler
  Downloading https://files.pythonhosted.org/packages/8f/fd/d92b3295657f8837e0177e7b48b32d6651436f0293af42b76d134c3bb489/memory_profiler-0.58.0.tar.gz
Building wheels for collected packages: memory-profiler
  Building wheel for memory-profiler (setup.py) ... [?25l[?25hdone
  Created wheel for memory-profiler: filename=memory_profiler-0.58.0-cp37-none-any.whl size=30180 sha256=d98b5b32c8b1a04b115d80c291d7184708c161fdfb0b2d15ec0e81cf5374ff78
  Stored in directory: /root/.cache/pip/wheels/02/e4/0b/aaab481fc5dd2a4ea59e78bc7231bb6aae7635ca7ee79f8ae5
Successfully built memory-profiler
Installing collected packages: memory-profiler
Successfully installed memory-profiler-0.58.0


In [25]:
import memory_profiler, time
if __name__ == '__main__':
	m1 = memory_profiler.memory_usage()
	t1 = time.process_time()
# To Print the sum	 
	filename = "/content/dataset/sample_data_large.csv"
	df= pd.read_csv(filename)
	total = df['Values'].sum()
	print("Total Values: %d "%total)

	t2 = time.process_time()
	m2 = memory_profiler.memory_usage()
	time_diff = t2 - t1
	mem_diff = m2[0] - m1[0]
	print(f"It took {time_diff} Secs and {mem_diff} Mb to execute this method")

Total Values: 524543837226 
It took 0.8312695569999997 Secs and 72.05859375 Mb to execute this method


In [None]:
## Memory and Time Profiling with the File Iterator Operation
if __name__ == '__main__':
	m1 = memory_profiler.memory_usage()
	t1 = time.process_time()
# To Print the sum	 
	filename = "/content/dataset/sample_data_large.csv"
	iter = pd.read_csv(filename,chunksize=1000)
	total_value = 0
	for record in iter:
    	total_value+=record['Values'].sum()
	print(total_value)

	t2 = time.process_time()
	m2 = memory_profiler.memory_usage()
	time_diff = t2 - t1
	mem_diff = m2[0] - m1[0]
	print(f"It took {time_diff} Secs and {mem_diff} Mb to execute this method")

The memory footprint is significantly less when using File Iterator.

For detailed information on the numerous parameters that can be specified for the read_csv() function, refer to the following link:

Go to: [Pandas 1.1.3 documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) [1]

Note: We have used read_csv() in our demonstration, but we can also use the read_table() function. The main difference between the two functions is the default value of delimiter / separator considered by the two functions.

- read_csv function considers a comma ’,’ as the default delimiter.
- read_table function considers tab ‘\t’ as the default delimiter.

# QUIZ

1. Select the Pandas read_csv() default delimiter:
  - ’- ‘ – the hyphen is the default delimiter.
  - ‘\t ‘ – the tab is the default delimiter.
  - ‘’\csv ‘ – the tab is the default delimiter.
  - **’, ‘ – the comma is the default delimiter.**

2. Select the Pandas read_table() default delimiter:
  - ‘’\csv ‘ – the tab is the default delimiter.
  - ’- ‘ – the hyphen is the default delimiter.
  - **‘\t ‘ – the tab is the default delimiter.**
  - ’, ‘ – the comma is the default delimiter.

3. What parameter do you use when you have a requirement to explicitly provide the column names instead of relying on auto-indexing for column names?
  - **names=[list of column names]**
  - header=None
  - header=ColNames
  - names=list of column names



