<a href="https://colab.research.google.com/github/machave11/Python---Data-Science/blob/main/Untitled31.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [34]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
from skimage.io import imread

# Difference between ndarray and series object
# Difference between Numpy and Pandas

There are some differences worth noting between ndarrays and Series objects. First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element. A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. You can also use letters instead of numbers, or number an array in descending order instead of ascending order. Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [35]:
# Series
obj = pd.Series([1,-2,7,5,3])
obj

0    1
1   -2
2    7
3    5
4    3
dtype: int64

In [36]:
obj.values

array([ 1, -2,  7,  5,  3])

In [37]:
obj.index

RangeIndex(start=0, stop=5, step=1)

In [38]:
# Pandas object with custom index
obj2 = pd.Series([1,-2,7,5,3], ['a','b','c','d','e'])
obj2

a    1
b   -2
c    7
d    5
e    3
dtype: int64

In [39]:
obj2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [40]:
obj2.values

array([ 1, -2,  7,  5,  3])

In [41]:
# Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains
# strings instead of integers.
obj2[['c', 'a', 'd']]

c    7
a    1
d    5
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:

In [42]:
'f' in obj2

False

In [43]:
'c' in obj2

True

In [44]:
# Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj4 = pd.Series(sdata)
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [45]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [46]:
# The isnull and notnull functions in pandas should be used to detect missing data:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [47]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [48]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
dtype: float64

A Series’s index can be altered in-place by assignment:

In [49]:
print(obj)

0    1
1   -2
2    7
3    5
4    3
dtype: int64


In [50]:
print(obj)
print()


0    1
1   -2
2    7
3    5
4    3
dtype: int64



# DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

In [56]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [57]:
## For large DataFrames, the head method selects only the first five rows:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [59]:
## If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [66]:
## If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'dept'], index=['one','two','three','four','five','six'])

#Filling the NAN columns (More on this later...)

In [None]:
#Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [67]:
from pandas.core.frame import DataFrame
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,,16.5
two,2001,Ohio,1.7,,16.5
three,2002,Ohio,3.6,,16.5
four,2001,Nevada,2.4,,16.5
five,2002,Nevada,2.9,,16.5
six,2003,Nevada,3.2,,16.5


In [68]:
frame2['dept'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,0.0,16.5
two,2001,Ohio,1.7,1.0,16.5
three,2002,Ohio,3.6,2.0,16.5
four,2001,Nevada,2.4,3.0,16.5
five,2002,Nevada,2.9,4.0,16.5
six,2003,Nevada,3.2,5.0,16.5


In [70]:
#When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. 
#If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,0.0,
two,2001,Ohio,1.7,1.0,-1.2
three,2002,Ohio,3.6,2.0,
four,2001,Nevada,2.4,3.0,-1.5
five,2002,Nevada,2.9,4.0,-1.7
six,2003,Nevada,3.2,5.0,


In [71]:
#Assigning a column that doesn’t exist will create a new column. 
#The del keyword will delete columns as with a dict. As an example of del, 
#I first add a new column of boolean values where the state column equals 'Ohio':
frame2['eastern'] = frame2.state == 'ohio'
frame2

Unnamed: 0,year,state,pop,dept,debt,eastern
one,2000,Ohio,1.5,0.0,,False
two,2001,Ohio,1.7,1.0,-1.2,False
three,2002,Ohio,3.6,2.0,,False
four,2001,Nevada,2.4,3.0,-1.5,False
five,2002,Nevada,2.9,4.0,-1.7,False
six,2003,Nevada,3.2,5.0,,False



Note: New columns cannot be created with the frame2.eastern syntax

In [72]:
# The del method can then be used to remove this column:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,dept,debt
one,2000,Ohio,1.5,0.0,
two,2001,Ohio,1.7,1.0,-1.2
three,2002,Ohio,3.6,2.0,
four,2001,Nevada,2.4,3.0,-1.5
five,2002,Nevada,2.9,4.0,-1.7
six,2003,Nevada,3.2,5.0,


Another common form of data is a nested dict of dicts. If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices

In [73]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [74]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [75]:
pd.DataFrame(pop, index = [2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


# Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [76]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [78]:
index[1:]

Index(['b', 'c'], dtype='object')

# index object are immutable and cant be modified by users

In [80]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [81]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [83]:
obj2.index is labels

True

NOTE: Unlike Python sets, a pandas Index can contain duplicate labels:

In [84]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

#Reindexing
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [85]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [86]:
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐ ing of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:

In [87]:
obj3 = pd.Series(['blue','yellow','red'], ['one','two','three'])
obj3

one        blue
two      yellow
three       red
dtype: object

In [101]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
 index=['a', 'c', 'd'],
 columns=['Ohio', 'Texas', 'California'])

frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [93]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [94]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


# Dropping Entries from an Axis

In [95]:
#the drop method will return a new object with the indicated value or values deleted from an axis:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [97]:
obj.drop('c')

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [99]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [102]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
 columns=['one', 'two', 'three', 'four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15
