Introduction

Use the pandas read_csv() function to read a CSV file (comma-separated) into a Python pandas DataFrameDataFrame. which supports options to read any delimited file. In this pandas article,


Lab Overview

In this lab, we will demonstrate how to read a CSV file with or without a header, skip rows, skip columns, set columns to index, handle missing data, and many more with examples.  By the end of this lab, Learners will be able to utilize the CSV file using Panda dataframe


In [4]:
import pandas as pd
import numpy as np
import json
df = pd.read_csv('employee.csv')
df

Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0
5,Adam,40.0,,4528000.0
6,Rooter,33.0,72.0,7028000.0
7,Peterson,42.0,85.0,2528000.0
8,lynda,42.0,85.0,
9,,42.0,85.0,


Example 2:Viewing or explore your data

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():


In [7]:
df.head()

Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0


.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: df.head(10) would output the top ten rows

In [8]:
df.head(10)

Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0
5,Adam,40.0,,4528000.0
6,Rooter,33.0,72.0,7028000.0
7,Peterson,42.0,85.0,2528000.0
8,lynda,42.0,85.0,
9,,42.0,85.0,


To see the last five rows, use df.tail(), which also accepts a number and prints the bottom two rows in this case.

In [9]:
df.tail(2)

Unnamed: 0,Name,Age,Weight,Salary
13,John,41.0,85.0,1528000.0
14,Ali,26.0,69.0,


.info() should be one of the very first commands you run after loading your data:

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    14 non-null     object 
 1   Age     12 non-null     float64
 2   Weight  14 non-null     float64
 3   Salary  12 non-null     float64
dtypes: float64(3), object(1)
memory usage: 608.0+ bytes


.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Another fast and useful attribute is .shape, which return just a tuple of (rows, columns):

In [12]:
df.shape

(15, 4)

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 15 rows and 4 columns in our employeeDataFrame.

Example 4: Skip Rows

Sometimes you may need to skip the first-row or skip the footer rows, use skiprows and skipfooter params, respectively.


In [13]:
# Skip first few rows
df = pd.read_csv('employee.csv', header=None, skiprows=5)
print(df)

           0     1      2          3
0      Gayle  40.0  100.0  4528000.0
1       Adam  40.0    NaN  4528000.0
2     Rooter  33.0   72.0  7028000.0
3   Peterson  42.0   85.0  2528000.0
4      lynda  42.0   85.0        NaN
5        NaN  42.0   85.0        NaN
6      Jenny   NaN  100.0    25632.0
7       Kenn   NaN  110.0    25632.0
8        Aly   NaN   90.0    25582.0
9       John  41.0   85.0  1528000.0
10       Ali  26.0   69.0        NaN


Example 5: Load only Selected Columns

There are two common ways to use this argument:
 Method 1: Use usecols with Column Names
df = pd.read_csv('my_data.csv', usecols=['column name one', 'column name two'])

Method 2: Use usecols with Column Positions
df = pd.read_csv('my_data.csv', usecols=[0, 2])

In [17]:
#method 1
df = pd.read_csv('employee.csv', usecols =['Name', 'Salary'])
print(df)

#method 2
#df = pd.read_csv('employee.csv', usecols =[0,3])
#print(df)


        Name     Salary
0      James  5428000.0
1    Villers  3428000.0
2      VKole  8428000.0
3      Smith  4428000.0
4      Gayle  4528000.0
5       Adam  4528000.0
6     Rooter  7028000.0
7   Peterson  2528000.0
8      lynda        NaN
9        NaN        NaN
10     Jenny    25632.0
11      Kenn    25632.0
12       Aly    25582.0
13      John  1528000.0
14       Ali        NaN


Example 6: Set DataTypes to Columns

By default, read_csv() assigns the data type that best fits the data. We can find data type of the columns by using df.dtypes

In [18]:
df = pd.read_csv('employee.csv')
print(df.dtypes)


Name       object
Age       float64
Weight    float64
Salary    float64
dtype: object


Letâ€™s change the Name columns to the String type.

In [22]:
# Set column data types
df = pd.read_csv('employee.csv', dtype={'Name':'string' })
#df = pd.read_csv('employee.csv', dtype={'Salary':'string' })
print(df.dtypes)


Name       string
Age       float64
Weight    float64
Salary    float64
dtype: object


Example 7: Handling Missing Data or NaN VALUE

As we've learned in the previous lesson, the fillna() method can be used to deal with NaN values.


In [23]:
df = pd.read_csv('employee.csv', dtype={'Name':'string' })
df2 = df.fillna(value={'Name':'Verification Pending','Age':"Unknown", 'Weight': "pending", 'Salary': 0.0})
print(df2)

                    Name      Age   Weight     Salary
0                  James     36.0     75.0  5428000.0
1                Villers     38.0     74.0  3428000.0
2                  VKole     31.0     70.0  8428000.0
3                  Smith     34.0     80.0  4428000.0
4                  Gayle     40.0    100.0  4528000.0
5                   Adam     40.0  pending  4528000.0
6                 Rooter     33.0     72.0  7028000.0
7               Peterson     42.0     85.0  2528000.0
8                  lynda     42.0     85.0        0.0
9   Verification Pending     42.0     85.0        0.0
10                 Jenny  Unknown    100.0    25632.0
11                  Kenn  Unknown    110.0    25632.0
12                   Aly  Unknown     90.0    25582.0
13                  John     41.0     85.0  1528000.0
14                   Ali     26.0     69.0        0.0


Example 8: Pandas Read Multiple CSV Files into DataFrame


Sometimes you may need to read or import multiple CSV files from a folder or from a list of files and convert them into a Pandas DataFrame. You can do this by reading each CSV file into a DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files.


When you want to read multiple CSV files that exist in different folders, first create a list of strings with absolute paths and use it as shown below to load all CSV files and create one big Pandas DataFrame.


In [3]:
# Read CSV files from List
with open("cars.json", "r") as read_file:
    data= json.dump(read_file)
    
df = pd.concat(map(pd.read_csv, ['car_data1.csv', 'car_data2.csv','car_data3.csv']))
df


FileNotFoundError: [Errno 2] No such file or directory: 'car_data1.csv'