# <font color='289C4E'>Data structures<font><a class='anchor' id='top'></a>
### <font color='blue'>Series<font><a class='anchor' id='top'></a>
- [Array-like](#1)
- [Dictionary](#2)
- [Scalar](#3)
- [Series is similar to array](#4)
- [Series is similar to dictionary](#5)
- [Name attribute](#6)
### <font color='blue'>DataFrame<font><a class='anchor' id='top'></a>
- [From dict of Series or dicts](#1)
- [From dict of array-likes](#2)
- [From a list of dicts](#3)
- [From a dict of tuples](#4)
- [From a Series](#5)

Pandas operates with three basic data structures: Series, DataFrame, and Panel. There are extensions to this list, but for the purposes of this material even the first two are more than enough.

We start by importing NumPy and Pandas using their conventional short names:



In [1]:
pip install numpy




In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np

In [41]:
import pandas as pd

## <font color='blue'>Series<font><a class='anchor' id='top'></a>
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

 s = Series(data, index=index)

The first mandatory argument can be

- array-like
- dictionary
- scalar

In [5]:
marks=['91','82','93','94'] 
name =["hari","Givina","Shyam","Gita"]
# Create Pandas Series
student = pd.Series(marks, index=name)
print(student)

hari      91
Givina    82
Shyam     93
Gita      94
dtype: object


### Array-like
If data is an array-like, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [6]:
#To create Random Number
randn = np.random.randn

In [7]:
# to print random number of series with these index
s = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -1.263560
b    1.081665
c   -1.491143
d    0.775689
e    1.481738
dtype: float64

In [8]:
# we can print only index with index data type
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [9]:
# To print only series data values
pd.Series(randn(5))

0    0.424525
1   -0.919215
2   -0.251130
3    0.027464
4   -1.292710
dtype: float64

### Dictionary
Dictionaries already have a natural candidate for the index, so passing the index separately seems redundant, although possible.



In [10]:
#Creating Pandas Series from Dictionary 
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [11]:
#we can get random data from dictionary with data type
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

### Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.



In [12]:
# Creating a Pandas Series with scalar data (6.0) and custom index labels
pd.Series(6., index=['a', 'b', 'c', 'd', 'e'])

a    6.0
b    6.0
c    6.0
d    6.0
e    6.0
dtype: float64

### Series is similar to array
Slicing and other operations on Series produce very similar results to those on array but with a twist. Index is also sliced and always remain a part of a data container.

In [13]:
#to display second Number of series
s.iloc[1]

1.081665386978438

In [14]:
#Creating First four Number of Series
s[:4] 

a   -1.263560
b    1.081665
c   -1.491143
d    0.775689
dtype: float64

In [15]:
s[s > s.median()]

b    1.081665
e    1.481738
dtype: float64

In [16]:
#to display random no. of series
s.iloc[[4, 3, 1]]

e    1.481738
d    0.775689
b    1.081665
dtype: float64

Similarly to NumPy arrays, Series can be used to speed up loops by using vectorization.

In [17]:
s + s

a   -2.527120
b    2.163331
c   -2.982285
d    1.551378
e    2.963476
dtype: float64

In [18]:
s * 2

a   -2.527120
b    2.163331
c   -2.982285
d    1.551378
e    2.963476
dtype: float64

In [19]:
# Taking the exponential (e^x) of each element in the Pandas Series using np.exp()
np.exp(s)

a    0.282646
b    2.949588
c    0.225115
d    2.172088
e    4.400588
dtype: float64

A key difference between Series and array is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [20]:
s[1:] + s[:-1]

a         NaN
b    2.163331
c   -2.982285
d    1.551378
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

### Series is similar to dictionary
A few examples to illustrate the heading.

In [21]:
#display value from index
s['a']

-1.2635602234951806

In [22]:
#Change the value of Index
s['e'] = 12.

In [23]:
s

a    -1.263560
b     1.081665
c    -1.491143
d     0.775689
e    12.000000
dtype: float64

In [24]:
#Check the index is include in index or not
'e' in s

True

In [25]:
'f' in s

False

### Name attribute
Series can also have a name attribute which will become very useful when summarizing data with tables and plots.



In [26]:
#series with including name attribute
s = pd.Series(np.random.randn(5),index=["a","b","c","d","e"], name='random series')
s

a   -1.194840
b    0.957368
c   -0.740591
d   -0.146684
e    1.061110
Name: random series, dtype: float64

In [27]:
#print name attribute only
s.name

'random series'

### <font color='blue'>DataFrame<font><a class='anchor' id='top'></a>
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- A Series
- Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

### From dict of Series
The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

In [28]:
#dilplaying dataframes of name age and city
data={
    'name':pd.Series(["ram", "Gopal","Krishna"], index=[1,2,3]),
    'age':pd.Series(["21", "31", "29"],index=[1,2,3]),
    'city':pd.Series(["Kathmandu","Surkhet","Nepaljung"], index=[1,2,3])
}
df=pd.DataFrame(data)
df

Unnamed: 0,name,age,city
1,ram,21,Kathmandu
2,Gopal,31,Surkhet
3,Krishna,29,Nepaljung


In [29]:
#Display all data frame
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])
    }
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [30]:
#Display only needed Index
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [31]:
# Create index with required column
pd.DataFrame(d, index=['a','d'], columns=['two'])

Unnamed: 0,two
a,1.0
d,4.0


In [32]:
#DIsplay needed indexes with required Columns
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


The row and column labels can be accessed respectively by accessing the index and columns attributes:

In [33]:
#display index of DataFrame with data type
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [34]:
#display columns of DataFrame with data type
df.columns

Index(['one', 'two'], dtype='object')

## From dict of array-likes
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [35]:
#Display dataframe from Dtctionary
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [36]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a list of dicts

In [37]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [38]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [39]:
pd.DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


## From a dict of tuples

In [40]:

pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},  ('a', 'a'): {('A', 'C'): 3,
                ('A', 'B'): 4},   ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, 
               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},     
               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0
