# **Pandas Objects Tutorial**

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



In [3]:
import numpy as np
import pandas as pd

**Object Creation:**


*   Series.
*   Dataframe.



**What's a Series?**

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>s = pd.Series(data, index=index)

Here, data can be many different things:

*   Python Dict
*   an ndarray (n-dimensional array object defined in the numpy which stores the collection of the similar type of elements.)
*   a scalar value 

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [5]:
a =[4,2,3,4]
s=pd.Series(a)
print(s)
print (type(s))

0       4
1       2
2       3
3    MIRA
dtype: object
<class 'pandas.core.series.Series'>


*Notice* in the below code because the list a has different types of items. s dtype is considered of type object

A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.)

In [None]:
a=[1,6,8,"mira"]
s= pd.Series(a)
print(s)
print (type(s))

0       1
1       6
2       8
3    mira
dtype: object
<class 'pandas.core.series.Series'>


In [9]:
s= pd.Series(np.random.randn(5), index=["a", "b", "c", "e","e"])

s

a   -0.641288
b    0.387155
c   -1.231523
e   -0.352819
e    1.042302
dtype: float64

In [None]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Note:
pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

**Series intiated from dicts:** 

In [11]:
d = {"b": 1, "a": 0, "c": 2}

pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [12]:
pd.Series(d, index=["b", "c", "d", "a"])

r   NaN
f   NaN
d   NaN
e   NaN
dtype: float64

**Series intiated from scalar value:** 

In [13]:
pd.Series("fpoiofwoel", index=["a", "b", "c", "d", "e"])

a    fpoiofwoel
b    fpoiofwoel
c    fpoiofwoel
d    fpoiofwoel
e    fpoiofwoel
dtype: object

**Series is ndarray-like**

Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [14]:
s[0]


-0.6412876701114522

In [None]:
s[:3]

a    2.983900
b    1.333895
c    0.471450
dtype: float64

In [None]:
s[s > s.median()]

a    2.983900
b    1.333895
dtype: float64

In [16]:
s[[4, 3, 1]]

KeyError: ignored

**Vectorized operations and label alignment with Series**

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [None]:
s + s

a    5.967800
b    2.667790
c    0.942899
d    0.576990
e   -1.600588
dtype: float64

In [None]:
s * 2

a    5.967800
b    2.667790
c    0.942899
d    0.576990
e   -1.600588
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

In [None]:
s[1:] + s[:-1]

a         NaN
b    2.667790
c    0.942899
d    0.576990
e         NaN
dtype: float64

In [19]:
s[-1]
s

a   -0.641288
b    0.387155
c   -1.231523
e   -0.352819
e    1.042302
dtype: float64

**What's a DataFrame?**

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

*   Dict of 1D ndarrays, lists, dicts, or Series
*   2-D numpy.ndarray
*   Structured or record ndarray
*   A Series
*   Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [21]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0,5.0], index=["a", "b", "c","d"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "e"]),
}
df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,5.0,
e,,4.0


In [22]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,5.0,
b,2.0,2.0
a,1.0,1.0


In [23]:
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "One"])

Unnamed: 0,two,One
d,,
b,2.0,
a,1.0,


In [None]:
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [None]:
df.columns

Index(['one', 'two'], dtype='object')

**Column selection, addition, deletion**

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [None]:
df["one"]


a    1.0
b    2.0
c    3.0
d    5.0
e    NaN
Name: one, dtype: float64

In [None]:
df["three"] = df["one"] * df["two"]
df["three"]

a    1.0
b    4.0
c    9.0
d    NaN
e    NaN
Name: three, dtype: float64

In [None]:
df["flag"] = df["one"] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,5.0,,,True
e,,4.0,,False


In [None]:
del df["two"]

df

Unnamed: 0,one,three,flag
a,1.0,1.0,False
b,2.0,4.0,False
c,3.0,9.0,True
d,5.0,,True
e,,,False


When inserting a scalar value, it will naturally be propagated to fill the column:

In [None]:
df["foo"] = "bar"
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,5.0,,True,bar
e,,,False,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [None]:
df["one_trunc"] = df["one"][:2]

df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,5.0,,True,bar,
e,,,False,bar,


Creating a DataFrame by passing a NumPy array, with a datetime index using date_range() and labeled columns:

In [None]:
dates = pd.date_range("20130101", periods=6)

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

Unnamed: 0,A,B,C,D
2013-01-01,-1.114589,0.575136,-1.184577,2.249573
2013-01-02,1.38935,1.775633,0.344217,1.29771
2013-01-03,-0.109973,-2.684794,0.738659,0.140437
2013-01-04,-1.087432,0.739577,0.692542,-1.301616
2013-01-05,0.031542,1.207686,0.805576,2.159795
2013-01-06,-1.008846,0.085557,0.193564,-0.007296


Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [None]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)


    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2
