### What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data

### Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



### Getting started

In [7]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas

Pandas as pd

Pandas is usually imported under the pd alias.

In [2]:
import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.

In [5]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


Checking library Version

The version string is stored under __version__ attribute.

In [6]:
print(pd.__version__)

1.4.4


## Pandas Series

### What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [3]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

#### Return the first value of the Series:

In [9]:
myvar

0    1
1    7
2    2
dtype: int64

#### Create your own labels:

In [4]:
a = [1, 7, 2]

myvar1 = pd.Series(a, index = ["p", "q", "r"])

myvar1

p    1
q    7
r    2
dtype: int64

When you have created labels, you can access an item by referring to the label.

In [5]:
print(myvar1["q"])

7


### Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

In [8]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)
print(myvar["day2"])

day1    420
day2    380
day3    390
dtype: int64
380


Note: The keys of the dictionary become the labels.

Create a Series using only data from "day1" and "day2":

In [15]:
myvar1 = pd.Series(calories, index = ["day1", "day3"])

print(myvar1)

day1    420
day3    390
dtype: int64


### DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

Create a DataFrame from two Series:

### What is a DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [9]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


### Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [10]:
print(df.loc[0])
#print(type(df.loc[0]))

calories    420
duration     50
Name: 0, dtype: int64


Note: This example returns a Pandas Series.

In [12]:
#use a list of indexes:
print(df.loc[[1,2]])
#print(type(df.loc[[1,2]]))

   calories  duration
1       380        40
2       390        45


Note: When using [], the result is a Pandas DataFrame.

In [22]:
df.loc[[1,2]]

Unnamed: 0,calories,duration
1,380,40
2,390,45


### Named Indexes

With the index argument, you can name your own indexes.

In [13]:
df1 = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df1

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


### Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

In [16]:
print(df1.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


#### Difference between loc() and iloc()

Both are used for slicing data from pandas dataframe. Used for filtering the data according to some condition.

They help in the convenient selection of data from the DataFrame. 

#### loc()

The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc(). Many operations can be performed using the loc() method like

In [38]:
# creating a sample dataframe
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata',
                               'Mahindra', 'Maruti', 'Hyundai',
                               'Renault', 'Tata', 'Maruti'],
                     'Year': [2012, 2014, 2011, 2015, 2012,
                              2016, 2014, 2018, 2019],
                     'Kms Driven': [50000, 30000, 70000,
                                    25000, 10000, 46000,
                                    31000, 15000, 12000],
                     'City': ['Gurgaon', 'Delhi', 'Mumbai',
                              'Delhi', 'Mumbai', 'Delhi',
                              'Mumbai', 'Chennai',  'Ghaziabad'],
                     'Mileage':  [28, 27, 25, 26, 28,
                                  29, 24, 21, 24]})
 
# displaying the DataFrame
display(data)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,70000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [22]:
# selecting cars with brand 'Maruti' and Mileage > 25
display(data.loc[(data.Brand == 'Maruti') & (data.Mileage > 10)])

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
4,Maruti,2012,10000,Mumbai,28
8,Maruti,2019,12000,Ghaziabad,24


Selecting a range of rows from the DataFrame

In [23]:
display(data.loc[1:5,['Brand','Year']])

Unnamed: 0,Brand,Year
1,Hyundai,2014
2,Tata,2011
3,Mahindra,2015
4,Maruti,2012
5,Hyundai,2016


In [24]:
data

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


Updating the value of any column

In [25]:
data.loc[(data.Year < 2015), ['Mileage']] = 22
display(data)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,22
1,Hyundai,2014,30000,Delhi,22
2,Tata,2011,60000,Mumbai,22
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,22
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,22
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


#### iloc()

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc(). iloc() does not accept the boolean data unlike loc(). Operations performed using iloc() are:

Selecting rows using integer indices

In [26]:
display(data.iloc[[0, 2, 4, 7]])

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,22
2,Tata,2011,60000,Mumbai,22
4,Maruti,2012,10000,Mumbai,22
7,Tata,2018,15000,Chennai,21


Selecting a range of columns and rows simultaneously

In [29]:
display(data.iloc[1: 5, 2: 5])

Unnamed: 0,Kms Driven,City,Mileage
1,30000,Delhi,22
2,60000,Mumbai,22
3,25000,Delhi,26
4,10000,Mumbai,22


In [31]:
display(data.loc[(data.Brand == 'Maruti')])

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,22
4,Maruti,2012,10000,Mumbai,22
8,Maruti,2019,12000,Ghaziabad,24


Series.loc

In [32]:
# Creating the Series
sr = pd.Series(['New York', 'Chicago', 'Toronto', 'Lisbon'])
  
# Creating the row axis labels
sr.index = ['City 1', 'City 2', 'City 3', 'City 4'] 
sr

City 1    New York
City 2     Chicago
City 3     Toronto
City 4      Lisbon
dtype: object

selecting values based on index name

In [33]:
sr.loc[['City 4', 'City 3', 'City 1']]

City 4      Lisbon
City 3     Toronto
City 1    New York
dtype: object

selecting range of values in a series

In [34]:
sr.iloc[0:3]

City 1    New York
City 2     Chicago
City 3     Toronto
dtype: object

In [35]:
sr.iloc[:3]

City 1    New York
City 2     Chicago
City 3     Toronto
dtype: object

use boolean mask the same length as the index

In [36]:
sr.iloc[[True, False, True, False]]

City 1    New York
City 3     Toronto
dtype: object

slice by the index of the row

In [37]:
sr.iloc[[0, 2]]

City 1    New York
City 3     Toronto
dtype: object

### Pandas Read and Write Files

#### read_csv

pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=_NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)[source]


In [41]:
df = pd.read_csv('data.csv')

print(type(df)) 
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45
3,300,60
4,430,35
5,210,45


You cannot pass a string as a path. You need to tell Python that this is a path and not a string.
3 ways of doing it.

1. put r before the start of the string. This will make it a raw string
2. replace \ with /
3. replace \ with \\\

In [42]:
dfDifferentPath = pd.read_csv(r'C:\Users\shesmeg\OneDrive - Baker Hughes\Personal\Besant_Technologies\DSTraining\Research_and_development_survey_2021_CSV_notes.csv')

In [43]:
dfDifferentPath

Unnamed: 0,Number,Footnote
0,1,Sector and published industry breakdowns accor...
1,2,In 2019 the R&D Survey was conducted only for ...
2,3,Results for this category should be treated wi...
3,4,Includes a wide range of ANZSIC industry codes...
4,5,Based on Rolling Mean employment. See Datainfo...
5,6,Only collected from the business sector. Not c...
6,7,Only collected from the business sector and go...
7,8,Only collected from the business sector and hi...
8,9,Only collected from the government and higher ...
9,10,Only collected from the higher education sect...


#### to_csv

DataFrame.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', lineterminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storage_options=None)[source]


In [44]:
data.to_csv(r'C:\Users\shesmeg\OneDrive - Baker Hughes\Personal\Besant_Technologies\DSTraining\Module_4\car_data1.csv')