#  Introduction to Numpy and Pandas
> This blog gives introduction to basic commands

- toc:true- branch: master
- badges: true
- comments: true
- author: Prachi Natu
- categories: ['fastpages', 'jupyter']
- image: images/numpy1.png

# Introduction to Python Libraries: Numpy and Pandas

**This blog explores Numpy Library For Basic Array/Matrix operations and Pandas Library for reading and manipulating different types of files.<br>
<br>
Prerequisites:** Basic Data structures in python like list, tuple, dictionaary and control structures in python<br>
<br>

## Numpy

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. Using NumPy, mathematical and logical operations on arrays can be performed. NumPy provides the foundation data structures and operations for SciPy.
Install Numpy package using command **pip install numpy** or **conda install numpy** command

### Array
The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index.
Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an object of data-type object (called dtype).
Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types. The following diagram shows a relationship between ndarray, data type object (dtype) and array scalar type.
The basic ndarray is created using an array function in NumPy as follows −

### Defining Arrays and Basic Array Operations

In [1]:
# Import numpy library before using it
import numpy as np
x=np.array([1,2,3,4])# Define 1D Array
print(x) # Print the array

[1 2 3 4]


In [2]:
x1=np.array([[1,2,3],[4,5,6]]) #Define 2D Array
print(x1)

[[1 2 3]
 [4 5 6]]


In [3]:
np.shape(x) # Return the shape of an array. i.e. Number of rows and columns in array

(4,)

In [4]:
np.shape(x1)

(2, 3)

In [5]:
x1.shape # This also gives shape of array

(2, 3)

In [6]:
x1.ndim # ndim gives no. of dimensions of array. x1 is 2 dimensional array

2

In [7]:
x.ndim # Check dimension of x

1

In [8]:
print(len(x1))
len(x)

2


4

In [9]:
type(x1) # It gives type of x1

numpy.ndarray

In [10]:
x1.dtype # Gives data type of elements in x1

dtype('int32')

In [11]:
# creating array from nested python lists
list1=[1,2,3,4]
list2=[2,3,4,5]
list3=[3,4,5,6]
array1=np.array([list1,list2,list3])
print(array1)
print(array1.shape)

[[1 2 3 4]
 [2 3 4 5]
 [3 4 5 6]]
(3, 4)


In [12]:
# Generate array of random numbers in the interval[0,1]
# If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. 
# Default is None, in which case a single value is returned.
x3=np.random.random((4,3))
print(x3)

[[0.83139996 0.97840645 0.28350016]
 [0.63990735 0.82499188 0.98103649]
 [0.68873466 0.20833985 0.15354768]
 [0.89703541 0.23580797 0.86506907]]


In [13]:
# Return a new array of given shape and type, filled with zeros.
z=np.zeros((3,3))
print(z)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [14]:
# Return a new array of given shape and type, filled with zeros.
z1=np.ones((3,3))
print(z1)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [15]:
len(z1)

3

In [16]:
md=np.random.rand(2,3)# Generate matrix of shape 2*3 with matrix elements between 0 and 1
print(md)

[[0.97070558 0.16497442 0.04916694]
 [0.34241996 0.67914468 0.58664927]]


In [17]:
a=np.arange(1,5,1)# generate sequence of numbers excluding stop value, start, stop, interval
print(a)

[1 2 3 4]


In [18]:
a=np.arange(10,50,4)
print(a)
print(a.dtype)
type(a)

[10 14 18 22 26 30 34 38 42 46]
int32


numpy.ndarray

In [19]:
np.arange(3) # if no interval is mentioned, Default interval is one

array([0, 1, 2])

In [20]:
# Return evenly spaced numbers over a specified interval.
#Returns num evenly spaced samples, calculated over the interval [start, stop].
#The endpoint of the interval can optionally be excluded.
b=np.linspace(10,50,4)# generate sequence of numbers, start, stop,how many numbers
print(b)

[10.         23.33333333 36.66666667 50.        ]


In [21]:
# numpy.full(shape, fill_value, dtype=None, order='C')[source]
#Return a new array of given shape and type, filled with fill_value
m=np.full((5,5),11)
print(m)

[[11 11 11 11 11]
 [11 11 11 11 11]
 [11 11 11 11 11]
 [11 11 11 11 11]
 [11 11 11 11 11]]


In [22]:
a=np.arange(10,50,4)
print(a)
print(a.dtype)
type(a)

[10 14 18 22 26 30 34 38 42 46]
int32


numpy.ndarray

### Array slicing:
Contents of ndarray object can be accessed and modified by indexing or slicing, just like Python's in-built container objects.
As mentioned earlier, items in ndarray object follows zero-based index. Three types of indexing methods are available − field access, basic slicing and advanced indexing.

Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A Python slice object is constructed by giving start, stop, and step parameters to the built-in slice function. This slice object is passed to the array to extract a part of array.

In [23]:
# Extracting required elements of array
#If only one parameter is put, a single item corresponding to the index will be returned. 
#If a : is inserted in front of it, all items from that index onwards will be extracted. 
#If two parameters (with : between them) is used, items between the two indexes (not including the stop index)
#with default step one are sliced.
print(a)
print(a[2]) # element with index 2

[10 14 18 22 26 30 34 38 42 46]
18


In [24]:
print(a[2:5]) # elements from index 2 to 5-1 are printed

[18 22 26]


In [25]:
print(a[2:]) # 2nd element onwards all elements are printed

[18 22 26 30 34 38 42 46]


In [26]:
print(a[:3]) # element from index 0 to 3-1 are printed

[10 14 18]


In [27]:
print(a[2::2])# from index 2, with interval 2

[18 26 34 42]


In [28]:
print(a[2::3]) # from index 2, with interval 3

[18 30 42]


In [29]:
print(a[::-1]) # elements from last to first, i.e. reverse order as step is -1

[46 42 38 34 30 26 22 18 14 10]


In [30]:
print(a[::-2]) # Elements in reverse order with step -2

[46 38 30 22 14]


In [31]:
print(a[2::-1]) # Elements in reverse order from index 2 to 0 with step -1

[18 14 10]


In [32]:
print(a[-4::2]) # print elements from 4th last element onwards with step 2

[34 42]


In [33]:
a = np.array([[1,2,3],[3,4,5],[4,5,6]]) 
print(a)

# slice items starting from index
print ('Now we will slice the array from the index a[1:]')
print(a[1:])# Print first row onwards

[[1 2 3]
 [3 4 5]
 [4 5 6]]
Now we will slice the array from the index a[1:]
[[3 4 5]
 [4 5 6]]


#Slicing can also include ellipsis (…) to make a selection tuple of the same length as the dimension of an array. 
#If ellipsis is used at the row position, it will return an ndarray comprising of items in rows.

In [34]:
# this returns array of items in the second column 
print('The items in the second column are:')  
print(a[...,1])
print('\n')

# Now we will slice all items from the second row 
print('The items in the second row are:')
print(a[1,...]) 
print('\n')
# What is output of this? Execute and check
print(a[1:,...])

# Now we will slice all items from column 1 onwards 
print('The items column 1 onwards are:') 
print(a[...,1:])

The items in the second column are:
[2 4 5]


The items in the second row are:
[3 4 5]


[[3 4 5]
 [4 5 6]]
The items column 1 onwards are:
[[2 3]
 [4 5]
 [5 6]]


### Array manipulation
Several routines are available in NumPy package for manipulation of elements in ndarray object.
Reshape,Joining and broadcasting are covered here 

Concatenation refers to joining. This function is used to join two or more arrays of the same shape along a specified axis.

NumPy has in-built support for broadcasting. This function mimics the broadcasting mechanism. It returns an object that encapsulates the result of broadcasting one array against the other.

In [35]:
np.reshape(x1,(3,2)) # Reshapes array to specified dimension

array([[1, 2],
       [3, 4],
       [5, 6]])

In [36]:
x1.reshape(3,-1)#for large arrays, we may not know the dimension of any one of the axis, in such case provide that unknown dimension as -1

array([[1, 2],
       [3, 4],
       [5, 6]])

In [37]:
np.transpose(a)# interchange rows and columns of array

array([[1, 3, 4],
       [2, 4, 5],
       [3, 5, 6]])

In [38]:
import numpy as np 
a = np.array([[1,2],[3,4]]) 

print('First array:')
print(a) 
print('\n')
b = np.array([[5,6],[7,8]]) 

print('Second array:') 
print(b) 
print('\n')
# both the arrays are of same dimensions 

print('Joining the two arrays along axis 0:') 
print(np.concatenate((a,b)))
print('\n')  

print('Joining the two arrays along axis 1:') 
print (np.concatenate((a,b),axis = 1))

First array:
[[1 2]
 [3 4]]


Second array:
[[5 6]
 [7 8]]


Joining the two arrays along axis 0:
[[1 2]
 [3 4]
 [5 6]
 [7 8]]


Joining the two arrays along axis 1:
[[1 2 5 6]
 [3 4 7 8]]


In [39]:
# Broadcasting

a = np.array([[0.0,0.0,0.0],[10.0,10.0,10.0],[20.0,20.0,20.0],[30.0,30.0,30.0]]) 
b = np.array([1.0,2.0,3.0])
print(a)
print(b)
print(a+b)

[[ 0.  0.  0.]
 [10. 10. 10.]
 [20. 20. 20.]
 [30. 30. 30.]]
[1. 2. 3.]
[[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]


### Arithmatic, Statistical and Linear algebraic functions in Numpy

#### Arithmatic Operations

In [40]:
# Add two arrays
print(a+b)
print(np.add(a,b))

[[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]
[[ 1.  2.  3.]
 [11. 12. 13.]
 [21. 22. 23.]
 [31. 32. 33.]]


In [41]:
a=np.array([[1,2],[3,4]])
b=np.array([[2,2],[3,3]])
print(a)
print(b)

[[1 2]
 [3 4]]
[[2 2]
 [3 3]]


In [42]:
# Subtract two arrays
print(a-b)
print(np.subtract(a,b))

[[-1  0]
 [ 0  1]]
[[-1  0]
 [ 0  1]]


In [43]:
# Multiply 2 matrices elementwise
print(a*b)
print(np.multiply(a,b))

[[ 2  4]
 [ 9 12]]
[[ 2  4]
 [ 9 12]]


In [44]:
# Divide one array by another, elementwise
print(a/b)
print(np.divide(a,b))

[[0.5        1.        ]
 [1.         1.33333333]]
[[0.5        1.        ]
 [1.         1.33333333]]


In [45]:
##the floor division // rounds the result down to the nearest whole number
print(a//b)

[[0 1]
 [1 1]]


In [46]:
print(a)# a is 2x2 matrix
# Sum of the array elements (a scalar value if axis is none) or array with sum values along the specified axis.
np.sum(a)# sum of elements in matrix

[[1 2]
 [3 4]]


10

In [47]:
# Sum of the array elements (a scalar value if axis is none) or array with sum values along the specified axis.
np.sum(a,axis=0)

array([4, 6])

In [48]:
np.sum(a,axis=1)

array([3, 7])

### Statistical Functions in Numpy
NumPy has quite a few useful statistical functions for finding minimum, maximum, mean,median from the given elements in the array.

In [49]:
# Using np.min, max, mean,median,argmax,argmin 
print(np.min(a))
print(np.max(a))
print(np.mean(a))
print(np.median(a))
print(np.argmin(a))
print(np.argmax(a))

1
4
2.5
2.5
0
3


### Algebraic Functions in Numpy

In [50]:
# Matrix product of two arrays. If both arguments are 2-D they are multiplied like conventional matrices.
np.matmul(a,b)

array([[ 8,  8],
       [18, 18]])

In [51]:
print(a*b)# element wise multiplication

[[ 2  4]
 [ 9 12]]


In [52]:
# This function returns the dot product of two arrays. 
#For 2-D vectors, it is the equivalent to matrix multiplication. For 1-D arrays, it is the inner product of the vectors. 
#For N-dimensional arrays, it is a sum product over the last axis of a and the second-last axis of b

print(a.dot(b))# matrix multiplication

[[ 8  8]
 [18 18]]


In [53]:
# try np.mod
# np.reminder

In [54]:
# Generating array with random values

ar1=np.random.rand(3,4) # Generate 3x4 matrix of nubers between 0 and 1
print(ar1)

[[0.85643973 0.1031917  0.55385525 0.97920609]
 [0.68588271 0.88334514 0.81916291 0.22139077]
 [0.35481625 0.52037969 0.85144163 0.99203129]]


In [55]:
ar2=np.random.randn(3,4) #Return a sample (or samples) from the “standard normal” distribution.
print(ar2)

[[ 0.4806007  -2.15999049 -0.30226734 -0.02449045]
 [ 0.32552263  0.27830224 -0.9562531  -2.73828117]
 [-1.37379804  0.50617974  0.68903693  1.60446845]]


In [56]:
# numpy.random.randint(low, high=None, size=None, dtype=int)
#Return random integers from low (inclusive) to high (exclusive).
#Return random integers from the “discrete uniform” distribution of the specified dtype in the “half-open” interval [low, high).
#If high is None (the default), then results are from [0, low).
ar3=np.random.randint(1,10,8).reshape(4,2)
print(ar3)

[[6 1]
 [8 6]
 [7 7]
 [1 2]]


In [57]:
# Use statistical functions on array a3

# find min element of array
print(np.min(ar3))

# Find maximum element of array
print(np.max(ar3))

# Find mean of all array elements
print(np.mean(ar3))

# Find Median
print(np.median(ar3,axis=0))

# index of smallest number in array
print(np.argmin(ar3))

# index of smallest number at given axis
print(np.argmin(ar3,axis=0))
print(np.argmin(ar3,axis=1))

# index of largest number in array
print(np.argmax(ar3))

1
8
4.75
[6.5 4. ]
1
[3 0]
[1 1 0 0]
2


In [58]:
#Return random floats in the half-open interval [0.0, 1.0).
#Results are from the “continuous uniform” distribution over the stated interval. 
ar4=np.random.random_sample(3)
print(ar4)

[0.01310365 0.57901306 0.80111693]


In [59]:
print(np.argmax(ar3))
print(np.argmax(ar3,axis=0))
print(np.argmax(ar3,axis=1))

2
[1 2]
[0 0 0 1]


## PANDAS
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

### Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
**s = pd.Series(data, index=index)**
Here, data can be many different things:
a Python dict, an ndarray, a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is.

In [60]:
import os
import numpy as np
import pandas as pd

In [61]:
# Generate a series as random array whose elements are drawn from normal distribution
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.590900
b    1.195079
c    0.236404
d    0.576171
e    0.580070
dtype: float64

In [62]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [63]:
#Series can be instantiated from dicts:
#When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order 
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [64]:
#When working with raw NumPy arrays, looping through value-by-value is usually not necessary. 
#The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.
s+s

a   -1.181801
b    2.390159
c    0.472807
d    1.152343
e    1.160141
dtype: float64

### Dataframe
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series, 2-D numpy.ndarray,structured or record ndarray,A Series, another DataFrame.

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [65]:
# Creating Dataframe from series
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [66]:
# Read .csv file
data=pd.read_csv('C:/Users/Swift 3/AppData/Local/Programs/Python/Python38/Iris_data_sample.csv')

In [67]:
# To view a small sample of a Series or DataFrame object, use the head() and tail() methods. 
#The default number of elements to display is five, but you may pass a custom number.
data.head()

Unnamed: 0.1,Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,,1.4,0.2,
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,??,3.1,1.5,0.2,Iris-setosa
4,5,5,3.6,###,0.2,Iris-setosa


In [68]:
# By default, idex column is labeled from 0 onwards.
#index_col=0 will remove default index column
# and Makes passed column as index instead of 0, 1, 2, 3…r
data=pd.read_csv('C:/Users/Swift 3/AppData/Local/Programs/Python/Python38/Iris_data_sample.csv',index_col=0)

In [69]:
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,,1.4,0.2,
3,4.7,3.2,1.3,0.2,Iris-setosa
4,??,3.1,1.5,0.2,Iris-setosa
5,5,3.6,###,0.2,Iris-setosa


In [70]:
data.index # to get row labels

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            141, 142, 143, 144, 145, 146, 147, 148, 149, 150],
           dtype='int64', length=150)

In [71]:
data.columns # to get column labels

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [72]:
data.shape# rows n columns in dataframe

(150, 5)

In [73]:
data.size # no of elements in dataframe

750

In [74]:
data.ndim # no of dimensions of dataframe

2

In [75]:
data.memory_usage()#Return the memory usage of each column in bytes.

Index            1200
SepalLengthCm    1200
SepalWidthCm     1200
PetalLengthCm    1200
PetalWidthCm     1200
Species          1200
dtype: int64

### Indexing and Slicing of DataFrame

In [76]:
data.head(5)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,,1.4,0.2,
3,4.7,3.2,1.3,0.2,Iris-setosa
4,??,3.1,1.5,0.2,Iris-setosa
5,5,3.6,###,0.2,Iris-setosa


In [77]:
data.tail(4)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica
150,5.9,3.0,5.1,1.8,Iris-virginica


In [78]:
data.loc[:,'SepalLengthCm']# to access rows or columns by label

1      5.1
2      4.9
3      4.7
4       ??
5        5
      ... 
146    6.7
147    6.3
148    6.5
149    6.2
150    5.9
Name: SepalLengthCm, Length: 150, dtype: object

In [79]:
data.loc[:,'Species']

1         Iris-setosa
2                 NaN
3         Iris-setosa
4         Iris-setosa
5         Iris-setosa
            ...      
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
150    Iris-virginica
Name: Species, Length: 150, dtype: object

In [80]:
data.loc[:,['SepalLengthCm','Species']]# access multiple columns by label

Unnamed: 0,SepalLengthCm,Species
1,5.1,Iris-setosa
2,4.9,
3,4.7,Iris-setosa
4,??,Iris-setosa
5,5,Iris-setosa
...,...,...
146,6.7,Iris-virginica
147,6.3,Iris-virginica
148,6.5,Iris-virginica
149,6.2,Iris-virginica


In [81]:
data.loc[4,:] # access 4th row of dataframe

SepalLengthCm             ??
SepalWidthCm             3.1
PetalLengthCm            1.5
PetalWidthCm             0.2
Species          Iris-setosa
Name: 4, dtype: object

In [82]:
data.loc[[4,7],:] # access 4th and 7th row of dataframe

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
4,??,3.1,1.5,0.2,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa


In [83]:
data.iloc[0:4,0:2]# access first 4 rows and first 2 columns using index of rows and columns

Unnamed: 0,SepalLengthCm,SepalWidthCm
1,5.1,3.5
2,4.9,
3,4.7,3.2
4,??,3.1


In [84]:
data.iloc[34:37,:]# dataframename.iloc[start:end,start:end]    end value is excluded

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
35,4.9,3.1,1.5,0.1,Iris-setosa
36,5.0,3.2,1.2,0.2,Iris-setosa
37,5.5,3.5,1.3,0.2,Iris-setosa


In [85]:
# create a copy of dataframe
iris1=data.copy(deep='False')# new variable shares only reference of original object
iris2=data.copy(deep='True')#new variable creates copy of original object

### Data Types
There are two main types of data
Numeric and character types.
Strings are known as objects in Pandas

In [86]:
data.dtypes # to know the data type of each column

SepalLengthCm     object
SepalWidthCm     float64
PetalLengthCm     object
PetalWidthCm     float64
Species           object
dtype: object

In [87]:
data.select_dtypes(exclude=[object])# exclude columns with 'object' data type

Unnamed: 0,SepalWidthCm,PetalWidthCm
1,3.5,0.2
2,,0.2
3,3.2,0.2
4,3.1,0.2
5,3.6,0.2
...,...,...
146,3.0,2.3
147,2.5,1.9
148,3.0,2.0
149,3.4,2.3


In [88]:
# Print a concise summary of a DataFrame.
#This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
# Syntax:  DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    object 
 1   SepalWidthCm   149 non-null    float64
 2   PetalLengthCm  149 non-null    object 
 3   PetalWidthCm   150 non-null    float64
 4   Species        149 non-null    object 
dtypes: float64(2), object(3)
memory usage: 12.0+ KB


In [89]:
#print(np.unique(data['Species'])) i.e. unique values in Species column
data['Species'].unique()

array(['Iris-setosa', nan, 'Iris-versicolor', 'Iris-virginica'],
      dtype=object)

In [90]:
# value_counts() Return a Series containing counts of unique rows in the DataFrame.
# DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False)
data['Species'].value_counts()# print count 

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        49
Name: Species, dtype: int64

#The missing values exist in the form of ‘nan','??','???'

#Python, by default replace blank values with ‘nan'
#### convert '??' and '???' values to NaN

In [91]:
data.isnull().sum()

SepalLengthCm    0
SepalWidthCm     1
PetalLengthCm    1
PetalWidthCm     0
Species          1
dtype: int64

In [92]:
data['SepalLengthCm'][0:5]# check the value before using na_values

1    5.1
2    4.9
3    4.7
4     ??
5      5
Name: SepalLengthCm, dtype: object

In [93]:
data=pd.read_csv('C:/Users/Swift 3/AppData/Local/Programs/Python/Python38/Iris_data_sample.csv',index_col=0,na_values=['??','###'])

In [94]:
data['SepalLengthCm'][0:5]# check the value before using na_values

1    5.1
2    4.9
3    4.7
4    NaN
5    5.0
Name: SepalLengthCm, dtype: float64

In [95]:
data.info()# observe data type before and after using na_values attribute

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  149 non-null    float64
 1   SepalWidthCm   149 non-null    float64
 2   PetalLengthCm  148 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


In [96]:
data.isnull().sum()# count no of missing values in each column

SepalLengthCm    1
SepalWidthCm     1
PetalLengthCm    2
PetalWidthCm     0
Species          1
dtype: int64

In [97]:
# reading excel file using Pandas
# before reading excel file, you need to import xlrd package to read excel files, Install it using 'pip install xlrd' command
iris_excel=pd.read_excel('C:/Users/Swift 3/AppData/Local/Programs/Python/Python38/Iris_data_sample.xlsx')