### **Pandas**

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,built on top of the Python programming language


### **Advantages**
* Fast and efficient for manipulating and analyzing data.
* Data from different file objects can be loaded.
* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
* **Size mutability:** columns can be inserted and deleted from DataFrame and higher dimensional objects
* Data set merging and joining.
* Flexible reshaping and pivoting of data sets
* Provides time-series functionality.
* Powerful group by functionality for performing split-apply-combine operations on data sets.

Pandas generally provide two data structures for manipulating data, They are: 

* Series
* DataFrame

### **Series**
* A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

      Syntax:
      pandas.Series( data, index, dtype, copy)
          data:data takes various forms like ndarray, list, constants.
          Index:Index values must be unique and hashable, same length as data. 
          dtype:dtype is for data type. If None, data type will be inferred.
          copy:Copy data. Default False



In [None]:
import numpy as np
import pandas as pd

In [None]:
#checking pandas version
print(pd.__version__)

1.5.3


In [None]:
#Creating Series from list
l=[1,2,3,4]
ser=pd.Series(l)
print(ser)

0    1
1    2
2    3
3    4
dtype: int64


In [None]:
#You can get the array representation and index object of the Series via its array and index attributes, respectively
ser.array

<PandasArray>
[1, 2, 3, 4]
Length: 4, dtype: int64

In [None]:
ser.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
#Creating series from array
data=np.array(['cracklogic','technocutter','python'])
ser_new=pd.Series(data)
print(ser_new)

0      cracklogic
1    technocutter
2          python
dtype: object


In [None]:
#create a Series with an index identifying each data point with a label
ser_arr=pd.Series([2,4,6,8],index=['A','B','C','D'])
print(ser_arr)

A    2
B    4
C    6
D    8
dtype: int64


In [None]:
#create series from scaler

ser_scaler=pd.Series(5,index=['a','b','c'])
print(ser_scaler)

a    5
b    5
c    5
dtype: int64


In [None]:
print('Index of manually created Series:')
print(ser_arr.index)
print('\n')
print('Series values',ser_arr.array)

Index of manually created Series:
Index(['A', 'B', 'C', 'D'], dtype='object')


Series values <PandasArray>
[2, 4, 6, 8]
Length: 4, dtype: int64


In [None]:
#Accesing element using Index(Label)
ser_arr['A']

2

In [None]:
##Accesing element using Position or slicing
ser_arr[:2]

A    2
B    4
dtype: int64

In [None]:
np.exp(ser_arr)

A       7.389056
B      54.598150
C     403.428793
D    2980.957987
dtype: float64

In [None]:
'B' in ser_arr

True

In [None]:
#Create series from dictionary
#if you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary

data={'Name':'Nitesh','Education':'BE','YOP':2016}
ser_data=pd.Series(data)
print(ser_data)

Name         Nitesh
Education        BE
YOP            2016
dtype: object


In [None]:
#A Series can be converted back to a dictionary with its to_dict method:
ser_data.to_dict()

{'Name': 'Nitesh', 'Education': 'BE', 'YOP': 2016}

In [None]:
#The isna and notna functions in pandas should be used to detect missing data:
print('isna example')
print(ser_data.isna())
print('*'*20)
print('notna example')
print(ser_data.notna())

isna example
Name         False
Education    False
YOP          False
dtype: bool
********************
notna example
Name         True
Education    True
YOP          True
dtype: bool


In [None]:
ser1=pd.Series({'DS':200000,'SE':100000,'DA':80000,'BA':70000})
ser2=pd.Series({'DS':100000,'SE':90000,'DA':80000,'Tester':70000})

In [None]:
#arithmetic operation performed based on index position of series.It return NaN if index not match in both the series
ser1+ser2

BA             NaN
DA        160000.0
DS        300000.0
SE        190000.0
Tester         NaN
dtype: float64

In [None]:
ser1-ser2

BA             NaN
DA             0.0
DS        100000.0
SE         10000.0
Tester         NaN
dtype: float64

In [None]:
ser1.index.name='Designation'

In [None]:
ser1

Designation
DS    200000
SE    100000
DA     80000
BA     70000
dtype: int64

In [None]:
pd.Series([np.nan]).sum()

0.0

###**DataFrame**
* A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). 

* The DataFrame has both a row and column index; it can be thought of
as a dictionary of Series all sharing the same index.

* A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

**Features of DataFrame**
* Potentially columns are of different types
* Size – Mutable
* Labeled axes (rows and columns)
* Can Perform Arithmetic operations on rows and columns
Structure.

**Example**  
Let us assume that we are creating a data frame with student’s data.

      pandas.DataFrame( data, index, columns, dtype, copy)
        data: data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
        index: For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.
        columns:For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.
        dtype:Data type of each column.
        copy:This command (or whatever it is) is used for copying of data, if the default is False.



A pandas DataFrame can be created using various inputs like −
* Lists
* dict
* Series
* Numpy ndarrays
* Another DataFrame

In [None]:
# Creating DataFrame Using List
Advertising=['TV','Radio','NewPaper']
Spend=[230,39,70]

#When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame.
data=pd.DataFrame([Spend],columns=Advertising) #set column name using column keyword

In [None]:
data

Unnamed: 0,TV,Radio,NewPaper
0,230,39,70


In [None]:
#Creating DataFrame Using dictionary
data={'Name':['Nitesh', 'Rohit', 'Prashant', 'Arun','Subhash'],'Age':[30,31,28,29,35]}
df=pd.DataFrame(data)
print(df)

       Name  Age
0    Nitesh   30
1     Rohit   31
2  Prashant   28
3      Arun   29
4   Subhash   35


In [None]:
##Creating DataFrame Using Series
data1=pd.Series(['A','B','C','D'])
data2=pd.Series([1,2,3,4])

data={'cols1':data1,'cols2':data2}

df_new=pd.DataFrame(data,columns=['cols1','cols2'])
print(df_new)

  cols1  cols2
0     A      1
1     B      2
2     C      3
3     D      4


In [None]:
#Create a DataFrame from Numpy ndarrays
a=np.array([1,2,3,5])
s=pd.DataFrame(a,columns=['cols1'])
print(s)

   cols1
0      1
1      2
2      3
3      5


In [None]:
# Create a DataFrame from another DataFrame
df = s.copy()
print(df)

   cols1
0      1
1      2
2      3
3      5


In [None]:
df.describe()

Unnamed: 0,cols1
count,4.0
mean,2.75
std,1.707825
min,1.0
25%,1.75
50%,2.5
75%,3.5
max,5.0


#### **Reindexing**
An important method on pandas objects is reindex, which means to create a new object with the values rearranged to align with the new index.

In [None]:
data=pd.Series([4, 7.2, -5.3, 3], index=["d", "b", "a", "c"])
data

d    4.0
b    7.2
a   -5.3
c    3.0
dtype: float64

In [None]:
data=data.reindex(['a','b','c','d'])

In [None]:
data

a   -5.3
b    7.2
c    3.0
d    4.0
dtype: float64

In [None]:
data = pd.DataFrame(np.arange(9).reshape((3, 3)),index=["a", "c", "d"],columns=["Ohio", "Texas", "California"])
data


Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
##Accesing the value using column name and index name
data.loc[['a','c'],['California','Ohio']]

Unnamed: 0,California,Ohio
a,2,0
c,5,3


In [None]:
#Dropping Entries from an Axis
data.drop('California',axis=1,inplace=True) #inplace is used to make changes permanently save in main object

In [None]:
#drop values from the row labels (axis 0) using index name
data.drop(index=['a'],inplace=True)
print(data)

   Ohio  Texas
c     3      4
d     6      7


In [None]:
#drop values from the columns by passing axis=1 or by mention name 
data.drop(['Ohio'], axis="columns")

Unnamed: 0,Texas
c,4
d,7


.loc index does not contain integers.there is also an iloc operator
that indexes exclusively with integers to work consistently whether or not the index contains integers

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])


In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#Accessing column using column name
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [None]:
#Accessing more then one columns
data[['three','one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [None]:
#Access the three row starting from index 0 to 2
data[:3]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11


In [None]:
#Access the elelemt from the dataframe column name three having value greater then 5
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
#Assign the value 0 to each location with the value True,
data[data<8]=0

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,0,0,0
Utah,8,9,10,11
New York,12,13,14,15


####**Indexing options with DataFrame**

DataFrame has special attributes **loc** and **iloc** for label-based and
integer-based indexing

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.loc['Ohio']

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int64

In [None]:
data.loc[['Ohio','Utah']]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11


In [None]:
data.loc[['Ohio','Utah'],['one','three']]

Unnamed: 0,one,three
Ohio,0,2
Utah,8,10


In [None]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [None]:
data.iloc[[2,1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,4,5,6,7


In [None]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [None]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [None]:
#Boolean arrays can be used with loc but not iloc:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.iloc[data.three >= 2]

ValueError: ignored

In [None]:
##with a integer index, there is ambiguity so we can get error.
ser1 = pd.Series(np.arange(3))
ser1


0    0
1    1
2    2
dtype: int64

In [None]:
ser1[-1]

KeyError: ignored

In [None]:
#with a noninteger index, there is no such ambiguity
ser2 = pd.Series(np.arange(3), index=["a", "b", "c"])
ser2

a    0
b    1
c    2
dtype: int64

In [None]:
ser2[-1]

2

**Note**

It is always prefer indexing with loc and iloc to avoid ambiguity.

In [None]:
df1 = pd.DataFrame(np.arange(4.).reshape((2, 2)), columns=list("bc"),index=["Mumbai", "Delhi"])
df2 = pd.DataFrame(np.arange(4.).reshape((2, 2)), columns=list("ac"),index=["Lucknow", "Delhi"])

In [None]:
df1

Unnamed: 0,b,c
Mumbai,0.0,1.0
Delhi,2.0,3.0


In [None]:
df2

Unnamed: 0,a,c
Lucknow,0.0,1.0
Delhi,2.0,3.0


In [None]:
#If you add DataFrame objects with no column or row labels in common, the result will contain all nulls
df1+df2

Unnamed: 0,a,b,c
Delhi,,,6.0
Lucknow,,,
Mumbai,,,


####**arithmetic methods**

* add, radd -->Methods for addition (+)
* sub, rsub -->Methods for subtraction (-)
* div, rdiv -->Methods for division (/)
* floordiv, -->rfloordiv Methods for floor division (//)
* mul, rmul -->Methods for multiplication (*)
* pow, rpow -->Methods for exponentiation (**)


NumPy ufuncs (element-wise array methods) also work with pandas objects

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),columns=list("bde"),index=["Utah", "Ohio", "Texas", "Oregon"])


In [None]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.513038,0.857794,0.744028
Ohio,1.033011,0.427139,1.701701
Texas,0.041444,1.333047,0.995152
Oregon,1.421775,1.371236,0.639282


####applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this

In [None]:
def fun(x):
  return x.max()-x.min()

In [None]:
frame.apply(fun)

b    2.934813
d    2.229031
e    2.696854
dtype: float64

In [None]:
frame.apply(fun,axis='columns')

Utah      2.257066
Ohio      2.734713
Texas     1.291603
Oregon    2.061057
dtype: float64

#### Sorting and Ranking

In [None]:
data=pd.Series(np.arange(4), index=["d", "a", "b", "c"])
data

d    0
a    1
b    2
c    3
dtype: int64

In [None]:
# Sort by index
data.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [None]:
# Sort by index on any axis
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),index=["three", "one"],columns=["d", "a", "b", "c"])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [None]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [None]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [None]:
#The data is sorted in ascending order by default but can be sorted in descending order, too
frame.sort_index(axis='columns',ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [None]:
##To sort a Series by its values, use its sort_values method:
ser_data=pd.Series([4, 7, -3, 2])

In [None]:
ser_data.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [None]:
#Any missing values are sorted to the end of the Series by default:
ser_data=pd.Series([np.nan, 7, 3, np.nan])
ser_data

0    NaN
1    7.0
2    3.0
3    NaN
dtype: float64

In [None]:
ser_data.sort_values()

2    3.0
1    7.0
0    NaN
3    NaN
dtype: float64

In [None]:
# Missing values can be sorted to the start instead by using the na_position option
ser_data.sort_values(na_position='first')

0    NaN
3    NaN
2    3.0
1    7.0
dtype: float64

In [None]:
ser_data.sort_values(na_position='last')

2    3.0
1    7.0
0    NaN
3    NaN
dtype: float64

In [None]:
 frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
 frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [None]:
frame.sort_values('b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [None]:
frame.sort_values('a')

Unnamed: 0,b,a
0,4,0
2,-3,0
1,7,1
3,2,1


In [None]:
#To sort by multiple columns, pass a list of names
frame.sort_values(['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [None]:
data=pd.Series([7, -5, 7, 4, 2, 0, 4])
data

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [None]:
#Ranking assigns ranks from one through the number of valid data points in an array,starting from the lowest value
#First this function sort the value and then using index assign rank.
#If data contains equal values, then they are assigned with the average of the ranks of each value by default
data.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [None]:
#Ranks can also be assigned according to the order in which they’re observed in the data
data.rank(method='first')

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64


0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [None]:
#rank in descending order,
data.rank(ascending=False)

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

In [None]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],"c": [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [None]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


* **"average"** Default: assign the average rank to each entry in the equal group.
* **"min"** Use the minimum rank for the whole group.
* **"max"** Use the maximum rank for the whole group.

* **"first"** Assign ranks in the order the values appear in the data.
* **"dense"** Like method="min", but ranks always increase by 1 between groups rather than the number of equal elements in a group.

In [None]:
frame.rank(axis='columns',method='first')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [None]:
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
                                   'spider', 'snake'],
                        'Number_legs': [4, 2, 4, 8, np.nan]})

In [None]:
df

Unnamed: 0,Animal,Number_legs
0,cat,4.0
1,penguin,2.0
2,dog,4.0
3,spider,8.0
4,snake,


In [None]:
df['Default Method']=df['Number_legs'].rank()
print(df)

    Animal  Number_legs  Default Method
0      cat          4.0             2.5
1  penguin          2.0             1.0
2      dog          4.0             2.5
3   spider          8.0             4.0
4    snake          NaN             NaN


In [None]:
df['min']=df['Number_legs'].rank(method='min')
df

Unnamed: 0,Animal,Number_legs,Default Method,min
0,cat,4.0,2.5,2.0
1,penguin,2.0,1.0,1.0
2,dog,4.0,2.5,2.0
3,spider,8.0,4.0,4.0
4,snake,,,


In [None]:
df['max']=df['Number_legs'].rank(method='max')
df

Unnamed: 0,Animal,Number_legs,Default Method,min,max
0,cat,4.0,2.5,2.0,3.0
1,penguin,2.0,1.0,1.0,1.0
2,dog,4.0,2.5,2.0,3.0
3,spider,8.0,4.0,4.0,4.0
4,snake,,,,


In [None]:
df['first']=df['Number_legs'].rank(method='first')
df

Unnamed: 0,Animal,Number_legs,Default Method,min,max,first
0,cat,4.0,2.5,2.0,3.0,2.0
1,penguin,2.0,1.0,1.0,1.0,1.0
2,dog,4.0,2.5,2.0,3.0,3.0
3,spider,8.0,4.0,4.0,4.0,4.0
4,snake,,,,,


In [None]:
df['dense']=df['Number_legs'].rank(method='dense')
df

Unnamed: 0,Animal,Number_legs,Default Method,min,max,first,dense
0,cat,4.0,2.5,2.0,3.0,2.0,2.0
1,penguin,2.0,1.0,1.0,1.0,1.0,1.0
2,dog,4.0,2.5,2.0,3.0,3.0,2.0
3,spider,8.0,4.0,4.0,4.0,4.0,3.0
4,snake,,,,,,


In [None]:
import pandas as pd
import numpy as np
data=pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
data

In [None]:
#The is_unique property of the index can tell you whether or not its labels are unique
data.index.is_unique

False

In [None]:
data['a']

a    0
a    1
dtype: int64

In [None]:
#Calling DataFrame’s sum method returns a Series containing column sums.
#It only sum of numerical column for categorical or object it concatenate the value.
df.sum()

Animal         catpenguindogspidersnake
Number_legs                        18.0
dtype: object

In [None]:
df.sum(axis='columns')

  df.sum(axis='columns')


0    4.0
1    2.0
2    4.0
3    8.0
4    0.0
dtype: float64

In [None]:
df.sum(axis="index", skipna=False)

Animal         catpenguindogspidersnake
Number_legs                         NaN
dtype: object

In [None]:
#Sum based on index 
df.sum(axis="index", skipna=True)

Animal         catpenguindogspidersnake
Number_legs                        18.0
dtype: object

In [None]:
df.mean(axis="columns")

  df.mean(axis="columns")


0    4.0
1    2.0
2    4.0
3    8.0
4    NaN
dtype: float64

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
df.idxmax()

one    b
two    d
dtype: object

In [None]:
df.idxmin()

one    d
two    b
dtype: object

#### Descriptive and summary statistics.



In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

data=pd.DataFrame(data)

In [None]:
data

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


#### **Summary**

* Count-->Number of non-NA values
* Mean-->Mean of values
* Std-->Sample standard deviation of values
* Min-->Compute minimum
* Max-->Compute maximum
* quantile-->Compute sample quantile ranging from 0 to 1

In [None]:
data.describe()

Unnamed: 0,calories,duration
count,3.0,3.0
mean,396.666667,45.0
std,20.81666,5.0
min,380.0,40.0
25%,385.0,42.5
50%,390.0,45.0
75%,405.0,47.5
max,420.0,50.0


#### Unique Values, Value Counts, and Membership
 

In [None]:
data=pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
data

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [None]:
#gives you an array of the unique values in a Series:
data.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [None]:
#value_counts computes a Series containing value frequencies
data.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [None]:
#isin performs a vectorized set membership check and can be useful in filtering a
#dataset down to a subset of values in a Series or column in a DataFrame
data.isin(['c','a'])

0     True
1     True
2    False
3     True
4     True
5    False
6    False
7     True
8     True
dtype: bool

In [None]:
data[data.isin(['c','a'])]

0    c
1    a
3    a
4    a
7    c
8    c
dtype: object

In [None]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],"Qu2": [2, 3, 1, 2, 3],"Qu3": [1, 5, 2, 4, 4]})
data


Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [None]:
data['Qu1'].value_counts()

3    2
4    2
1    1
Name: Qu1, dtype: int64

In [None]:
data['Qu1'].value_counts().sort_index()

1    1
3    2
4    2
Name: Qu1, dtype: int64

####**Text and binary data loading functions in pandas**

1. **read_csv:** Load delimited data from a file, URL, or file-like object; use comma as default delimiter
2. **read_fwf:** Read data in fixed-width column format (i.e., no delimiters)
3. **read_clipboard:** Variation of read_csv that reads data from the clipboard; useful for converting tables from web
pages
4. **read_excel:** Read tabular data from an Excel XLS or XLSX file
5. **read_hdf:** Read HDF5 files written by pandas

6. **read_html:** Read all tables found in the given HTML document
7. **read_json:** Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object

8. **read_feather:** Read the Feather binary file format
9. **read_orc:** Read the Apache ORC binary file format
10. **read_parquet:** Read the Apache Parquet binary file format
11. **read_pickle:** Read an object stored by pandas using the Python pickle format
12. **read_sas:** Read a SAS dataset stored in one of the SAS system’s custom storage formats
13. **read_spss:** Read a data file created by SPSS
14. **read_sql:** Read the results of a SQL query (using SQLAlchemy)
15. **read_sql_table:** Read a whole SQL table (using SQLAlchemy); equivalent to using a query that selects everything in that table using read_sql
16. **read_stata:** Read a dataset from Stata file format
17. **read_xml:** Read a table of data from an XML file

In [None]:
#Import the DataSet from CSV file
data=pd.read_csv('/content/Advertising.csv')

In [None]:
type(data)

pandas.core.frame.DataFrame

In [None]:
#the head method selects by defauly only the first five rows:
data.head()
#data.head(6)

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [None]:
#the tail method selects only the last five rows:
data.tail()
#data.tail(6)

Unnamed: 0,TV,radio,newspaper,sales
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,13.4


In [None]:
#Rows can also be retrieved by position or name with the special iloc and loc attributes
data.loc[1]

TV           44.5
radio        39.3
newspaper    45.1
sales        10.4
Name: 1, dtype: float64

In [None]:
data.iloc[1]

TV           44.5
radio        39.3
newspaper    45.1
sales        10.4
Name: 1, dtype: float64

In [None]:
#The del method can then be used to remove this column
del data['sales']

In [None]:
data.columns

Index(['TV', 'radio', 'newspaper'], dtype='object')

In [None]:
# you can skip the first, third, and fourth rows of a file with skiprows
data=pd.read_csv('/content/Advertising.csv',names=['TV', 'radio', 'newspaper','sales'],skiprows=[0,2,3])
data

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,151.5,41.3,58.5,18.5
2,180.8,10.8,58.4,12.9
3,8.7,48.9,75.0,7.2
4,57.5,32.8,23.5,11.8
...,...,...,...,...
193,38.2,3.7,13.8,7.6
194,94.2,4.9,8.1,9.7
195,177.0,9.3,6.4,12.8
196,283.6,42.0,66.2,25.5


In [None]:
pd.isna(data)

Unnamed: 0,TV,radio,newspaper,sales
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
193,False,False,False,False
194,False,False,False,False
195,False,False,False,False
196,False,False,False,False


In [None]:
#we look at a large file, we make the pandas display settings more compact:
pd.options.display.max_rows = 30
pd.options.display.max_columns = 30
pd.options.display.max_colwidth = 30

In [None]:
#To read only a small number of rows specify that with nrows
pd.read_csv('/content/Advertising.csv', nrows=20)   

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
5,8.7,48.9,75.0,7.2
6,57.5,32.8,23.5,11.8
7,120.2,19.6,11.6,13.2
8,8.6,2.1,1.0,4.8
9,199.8,2.6,21.2,10.6


In [None]:
#To read a file in pieces, specify a chunksize as a number of rows:
chunk = pd.read_csv('/content/Advertising.csv', chunksize=10)
type(chunk)

pandas.io.parsers.readers.TextFileReader

#### Writing Data to Text Format

In [None]:
data1=data.to_csv('/content/data_copy.csv')

In [8]:
data=[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]


In [2]:
type(data)

list

In [5]:
import json
import pandas as pd
import numpy as np

In [11]:
#converts a Python object back to JSON
a=json.dumps(data)
print(a)

[{"a": 1, "b": 2, "c": 3}, {"a": 4, "b": 5, "c": 6}, {"a": 7, "b": 8, "c": 9}]


In [14]:
# To convert a JSON string to Python form, use json.loads
json.loads(a)

[{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6}, {'a': 7, 'b': 8, 'c': 9}]

In [15]:
with open('demo.json','w') as f:
  f.write('[{"a": 1, "b": 2, "c": 3},{"a": 4, "b": 5, "c": 6},{"a": 7, "b": 8, "c": 9}]')


In [18]:
#Load json data
df_json=pd.read_json('/content/demo.json')
print(df_json)

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9


In [21]:
#If you need to export data from pandas to JSON, one way is to use the to_json methods on Series and DataFrame
import sys
df_json.to_json(sys.stdout)  #stdout is used to display output directly to the screen console

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

In [24]:
!pip install xlrd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
#To use pandas.ExcelFile, create an instance by passing a path to an xls or xlsx file
xls_demo=pd.ExcelFile('/content/demo.xlsx')

In [29]:
type(xls_demo)

pandas.io.excel._base.ExcelFile

In [30]:
#This object can show you the list of available sheet names in the file:
xls_demo.sheet_names

['Sheet1']

In [31]:
#Data stored in a sheet can then be read into DataFrame with parse:
xls_demo.parse(sheet_name="Sheet1")

Unnamed: 0,Name,Age,Salary
0,Nit,29,25000
1,Sam,28,22000
2,Ram,30,30000


In [32]:
#If you are reading multiple sheets in a file, then it is faster to create the pandas.Excel
#File, but you can also simply pass the filename to pandas.read_excel
frame = pd.read_excel("/content/demo.xlsx", sheet_name="Sheet1")

In [33]:
frame

Unnamed: 0,Name,Age,Salary
0,Nit,29,25000
1,Sam,28,22000
2,Ram,30,30000
