<a href="https://colab.research.google.com/github/harshsinha-12/GoogleCollabTutorial/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Official Documentation

In [6]:
import pandas as pd

In [7]:
import numpy as np

In [8]:
pd.DataFrame({'A':[1,2,3]})

Unnamed: 0,A
0,1
1,2
2,3


# Basic data structures in pandas

Pandas provides two types of classes for handling data:

1. Series: a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc.

2. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

In [9]:
s = pd.Series([10,9,8,7,6])
print(s)

0    10
1     9
2     8
3     7
4     6
dtype: int64


In [10]:
s = pd.Series([10,9,8,7,6, np.nan])
print(s)

0    10.0
1     9.0
2     8.0
3     7.0
4     6.0
5     NaN
dtype: float64


# Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [11]:
# s = pd.Series(data, index=index)

Here, data can be many different things:

1. a Python dict

2. an ndarray

3. a scalar value (like 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

# From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].





In [12]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
print(s)

a    1.055174
b   -0.677917
c    0.563005
d    2.328262
e   -1.217239
dtype: float64


In [13]:
s = pd.Series(np.random.randn(5))
print(s)

0    2.528049
1    1.109512
2   -2.137868
3    0.347307
4   -0.086190
dtype: float64


# From dict

Series can be instantiated from dicts:



In [14]:
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.



In [15]:
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)
print()
pd.Series(d, index = ['b', 'c', 'd', 'a'])




b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

# From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [16]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

# Series is ndarray-like

Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.


In [17]:
s = pd.Series([10,9,8,7,6])
s.iloc[0]

10

In [18]:
s[1]

9

In [19]:
s.iloc[0]
s[1]  #--------> Gets updated

9

In [20]:
s[:3]

0    10
1     9
2     8
dtype: int64

In [21]:
s.iloc[:3]

0    10
1     9
2     8
dtype: int64

In [22]:
s[s > s.median()]

0    10
1     9
dtype: int64

In [23]:
s.iloc[[4, 3, 1]]

4    6
3    7
1    9
dtype: int64

In [24]:
np.exp(s)

0    22026.465795
1     8103.083928
2     2980.957987
3     1096.633158
4      403.428793
dtype: float64

In the given example, s[1] and s.iloc[1] both access the element at the second position in the Series s, but they use different methods to do so.

1. s[1]: This uses the default index (which starts from 0) to access elements. In this case, it would return the element at the second position in the Series, which is 9. This indexing method is based on the position of the elements rather than the explicit index labels.

2. s.iloc[1]: This uses the .iloc method, which stands for "integer location." It explicitly uses the integer-based position to access elements. In this case, it would also return the element at the second position in the Series, which is 9.

In summary, in this specific example, s[1] and s.iloc[1] both give you the same result, but they use different ways to access the element at the second position in the Series. Using iloc is generally preferred when you want to explicitly specify the position using integers.

Like a NumPy array, a pandas Series has a single dtype.



In [25]:
s.dtype

dtype('int64')

In [26]:
s = pd.Series([10.0,9.5,8.3,7.4,6.7])
s.dtype

dtype('float64')

If you need the actual array backing a Series, use Series.array.

In [27]:
s.array

<PandasArray>
[10.0, 9.5, 8.3, 7.4, 6.7]
Length: 5, dtype: float64

In [28]:
s.to_numpy()

array([10. ,  9.5,  8.3,  7.4,  6.7])

# Series is dict-like

A Series is also like a fixed-size dict in that you can get and set values by index label:



In [29]:
s = pd.Series([3,4,5,6,7], index=["a", "b", "c", "d", "e"])
print(s)

a    3
b    4
c    5
d    6
e    7
dtype: int64


In [30]:
s["c"]

5

# Vectorized operations and label alignment with Series
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [31]:
s + s

a     6
b     8
c    10
d    12
e    14
dtype: int64

In [32]:
s * 2

a     6
b     8
c    10
d    12
e    14
dtype: int64

In [33]:
s[1:] + s[:-1]

a     NaN
b     8.0
c    10.0
d    12.0
e     NaN
dtype: float64

In [34]:
s.iloc[1:] + s.iloc[:-1]

a     NaN
b     8.0
c    10.0
d    12.0
e     NaN
dtype: float64

# Name attribute
Series also has a name attribute:

In [35]:
s = pd.Series([1,2,3,4,5], name="something")
print(s)

0    1
1    2
2    3
3    4
4    5
Name: something, dtype: int64


In [36]:
s2 = s.rename("different")
print(s2)
print(s)

0    1
1    2
2    3
3    4
4    5
Name: different, dtype: int64
0    1
1    2
2    3
3    4
4    5
Name: something, dtype: int64


# DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

1. Dict of 1D ndarrays, lists, dicts, or Series

2. 2-D numpy.ndarray

3. Structured or record ndarray

4. A Series

5. Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.



# From dict of Series or dicts
The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

In [37]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([10, 20, 30, 40], index=["a", "b", "c", "d"]),
}


df = pd.DataFrame(d)

df

#d = {                          #data R1C1, R2C1       # index R1C0 R2C1
#    "Column Name 1, Row 0" : pd.Series([1,2], index = ["a", "b"])
#}

Unnamed: 0,one,two
a,1.0,10
b,2.0,20
c,3.0,30
d,,40


In [38]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,,40
b,2.0,20
a,1.0,10


In [39]:
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])

Unnamed: 0,two,three
d,40,
b,20,
a,10,


In [40]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [41]:
df.columns

Index(['one', 'two'], dtype='object')

# From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [42]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [43]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d, index = ["a", "b", "c", "d"])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


# From structured or record array

This case is handled identically to a dict of arrays.

In [44]:
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [45]:
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,0,0.0,b''
1,0,0.0,b''


In [46]:
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
# The expression data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
# is used to assign new values to all the elements in the data variable.
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [47]:
pd.DataFrame(data, index=["first", "second"])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [48]:
pd.DataFrame(data, columns=["C", "A", "B"])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


In [49]:
a = [1,2,3,4,5,6,7,8,9]
a[:]

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [50]:
a = [1,2,3,4,5,6,7,8,9]
a[:] = [2,3,4]
a

[2, 3, 4]

In [51]:
a = [1,2,3,[4,5,6,7],8,9]
a[:] = [2,3,[4]]
a

[2, 3, [4]]

# From a list of dicts

In [52]:
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [53]:
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2, index = ["A", "B"])

Unnamed: 0,a,b,c
A,1,2,
B,5,10,20.0


In [54]:
pd.DataFrame(data2, columns=["a", "b"])

Unnamed: 0,a,b
0,1,2
1,5,10


In [55]:
pd.DataFrame(data2, index=["first", "second"])


Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


# From a dict of tuples

In [56]:
pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


# From a Series
The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

In [57]:
ser = pd.Series(range(3), index=list("abc"), name="ser")

pd.DataFrame(ser)

Unnamed: 0,ser
a,0
b,1
c,2


In [58]:
ser = pd.Series(range(7), index=list("abcdefg"), name="ser")
pd.DataFrame(ser)

Unnamed: 0,ser
a,0
b,1
c,2
d,3
e,4
f,5
g,6


# From a list of namedtuples

The field names of the first namedtuple in the list determine the columns of the DataFrame. The remaining namedtuples (or tuples) are simply unpacked and their values are fed into the rows of the DataFrame. If any of those tuples is shorter than the first namedtuple then the later columns in the corresponding row are marked as missing values. If any are longer than the first namedtuple, a ValueError is raised.

In [59]:
from collections import namedtuple

Point = namedtuple("Point", "x y")

pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)])

Unnamed: 0,x,y
0,0,0
1,0,3
2,2,3


In [60]:
Point3D = namedtuple("Point3D", "x y z")

pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)])

Unnamed: 0,x,y,z
0,0,0,0.0
1,0,3,5.0
2,2,3,


# Alternate constructors
# DataFrame.from_dict

DataFrame.from_dict() takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.



In [61]:
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column names:

In [62]:
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
                       orient = "index",
                       columns = ["one", "two", "three"])

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


# Column selection, addition, deletion
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [63]:
df["one"]

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [64]:
df["three"] = df["one"] * df["two"]
df["flag"] = df["one"] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,10,10.0,False
b,2.0,20,40.0,False
c,3.0,30,90.0,True
d,,40,,False


Columns can be deleted or popped like with a dict:

In [65]:
del df["two"]

three = df.pop("three")

df

Unnamed: 0,one,flag
a,1.0,False
b,2.0,False
c,3.0,True
d,,False


When inserting a scalar value, it will naturally be propagated to fill the column:



In [66]:
df["foo"] = "bar"

df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [67]:
df["one_trunc"] = df["one"][:2]

df

Unnamed: 0,one,flag,foo,one_trunc
a,1.0,False,bar,1.0
b,2.0,False,bar,2.0
c,3.0,True,bar,
d,,False,bar,


You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

By default, columns get inserted at the end. DataFrame.insert() inserts at a particular location in the columns:

In [68]:
df.insert(1, "bar", df["one"])

df

Unnamed: 0,one,bar,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,2.0,False,bar,2.0
c,3.0,3.0,True,bar,
d,,,False,bar,


# CodeWithHarry

In [69]:
dict1 = {
    "name" : ["Harry", 'Rohan',"Skillf","Shubh"],
    "marks" : [92,34,24,17],
    "city" : ["Rampur", "Kolkata", "Bareilly", "Antartica"]
}

In [70]:
df = pd.DataFrame(dict1)

In [71]:
df

Unnamed: 0,name,marks,city
0,Harry,92,Rampur
1,Rohan,34,Kolkata
2,Skillf,24,Bareilly
3,Shubh,17,Antartica


# DataFrame to CSV
Index is Row

In [72]:
df.to_csv('friends.csv')

In [73]:
df.to_csv('friends_index_false.csv', index = False)

In [74]:
df.head(2)

Unnamed: 0,name,marks,city
0,Harry,92,Rampur
1,Rohan,34,Kolkata


In [75]:
df.tail(2)

Unnamed: 0,name,marks,city
2,Skillf,24,Bareilly
3,Shubh,17,Antartica


# Statistical Analysis of Numerical Values

In [76]:
df.describe()

Unnamed: 0,marks
count,4.0
mean,41.75
std,34.21866
min,17.0
25%,22.25
50%,29.0
75%,48.5
max,92.0


In [77]:
harry = pd.read_csv("/content/sample_data/harry.csv - Sheet1.csv")

In [78]:
harry

Unnamed: 0,Train No,Speed,City
0,12322,34,rampur
1,12534,66,kolkata
2,12654,56,bareilly
3,65470,79,Antarctica


In [79]:
harry['Speed']

0    34
1    66
2    56
3    79
Name: Speed, dtype: int64

In [80]:
harry['Speed'][0]

34

In [81]:
harry['Speed'][0] = 50

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  harry['Speed'][0] = 50


In [82]:
harry

Unnamed: 0,Train No,Speed,City
0,12322,50,rampur
1,12534,66,kolkata
2,12654,56,bareilly
3,65470,79,Antarctica


In [83]:
harry.to_csv('harry.csv')

In [84]:
harry.index = ['first', 'second', 'third', 'fourth']

In [85]:
harry

Unnamed: 0,Train No,Speed,City
first,12322,50,rampur
second,12534,66,kolkata
third,12654,56,bareilly
fourth,65470,79,Antarctica


In [86]:
ser = pd.Series(np.random.rand)

In [87]:
ser

0    <built-in method rand of numpy.random.mtrand.R...
dtype: object

In [88]:
ser = pd.Series(np.random.rand(34))

In [89]:
ser

0     0.489362
1     0.065004
2     0.030768
3     0.367056
4     0.195849
5     0.474802
6     0.811549
7     0.600401
8     0.396841
9     0.949197
10    0.462437
11    0.323675
12    0.020645
13    0.821932
14    0.490837
15    0.732082
16    0.566755
17    0.778868
18    0.143618
19    0.932667
20    0.826151
21    0.370818
22    0.834765
23    0.714729
24    0.801738
25    0.121874
26    0.270177
27    0.975594
28    0.876604
29    0.357118
30    0.791072
31    0.973771
32    0.635847
33    0.150064
dtype: float64

In [90]:
type(ser)

pandas.core.series.Series

In [91]:
newdf = pd.DataFrame(np.random.rand(334,5), index = np.arange(334))

In [92]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.284361,0.916036,0.705687,0.726032,0.923238
1,0.88756,0.978838,0.096583,0.05511,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [93]:
newdf.tail()

Unnamed: 0,0,1,2,3,4
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.58031
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.29889
333,0.329516,0.697441,0.504897,0.259232,0.601372


In [94]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.284361,0.916036,0.705687,0.726032,0.923238
1,0.887560,0.978838,0.096583,0.055110,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.516220,0.252023,0.500908,0.322258,0.298890


In [95]:
type(newdf)

pandas.core.frame.DataFrame

In [96]:
newdf.describe()

Unnamed: 0,0,1,2,3,4
count,334.0,334.0,334.0,334.0,334.0
mean,0.496398,0.498413,0.48878,0.486723,0.513648
std,0.289394,0.271743,0.291972,0.286417,0.284989
min,0.009773,0.023078,0.004159,0.00254,0.004407
25%,0.237379,0.257812,0.224293,0.232717,0.279748
50%,0.515602,0.49432,0.482528,0.469902,0.494788
75%,0.735574,0.724678,0.753811,0.73443,0.768209
max,0.999482,0.996109,0.997076,0.99292,0.999865


In [97]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [98]:
newdf[0][0] = "harry"

In [99]:
newdf.dtypes

0     object
1    float64
2    float64
3    float64
4    float64
dtype: object

In [100]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,harry,0.916036,0.705687,0.726032,0.923238
1,0.88756,0.978838,0.096583,0.05511,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [101]:
newdf[0][1] = "harry"

In [102]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,harry,0.916036,0.705687,0.726032,0.923238
1,harry,0.978838,0.096583,0.05511,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [103]:
newdf[3][1] = "har"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf[3][1] = "har"


In [104]:
newdf.dtypes

0     object
1    float64
2    float64
3     object
4    float64
dtype: object

In [105]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,harry,0.916036,0.705687,0.726032,0.923238
1,harry,0.978838,0.096583,har,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [106]:
newdf.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
           dtype='int64', length=334)

In [107]:
newdf.columns

RangeIndex(start=0, stop=5, step=1)

In [108]:
newdf.to_numpy

<bound method DataFrame.to_numpy of             0         1         2         3         4
0       harry  0.916036  0.705687  0.726032  0.923238
1       harry  0.978838  0.096583       har  0.692406
2    0.903961  0.960901  0.677410  0.407416  0.840507
3    0.132554  0.657832  0.603474  0.269729  0.921647
4    0.129319  0.831966  0.814104  0.633965  0.440859
..        ...       ...       ...       ...       ...
329  0.071551  0.976249  0.936548  0.404837  0.583003
330  0.092559  0.725438  0.708244  0.413599  0.580310
331  0.424742  0.063721  0.921735  0.792344  0.404538
332   0.51622  0.252023  0.500908  0.322258  0.298890
333  0.329516  0.697441  0.504897  0.259232  0.601372

[334 rows x 5 columns]>

In [109]:
newdf.dtypes

0     object
1    float64
2    float64
3     object
4    float64
dtype: object

In [110]:
newdf[0][0] = 0.3
newdf[3][1] = 0.36
newdf[0][1] = 0.4

In [111]:
newdf.dtypes

0     object
1    float64
2    float64
3     object
4    float64
dtype: object

In [112]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [113]:
newdf.to_numpy

<bound method DataFrame.to_numpy of             0         1         2         3         4
0         0.3  0.916036  0.705687  0.726032  0.923238
1         0.4  0.978838  0.096583      0.36  0.692406
2    0.903961  0.960901  0.677410  0.407416  0.840507
3    0.132554  0.657832  0.603474  0.269729  0.921647
4    0.129319  0.831966  0.814104  0.633965  0.440859
..        ...       ...       ...       ...       ...
329  0.071551  0.976249  0.936548  0.404837  0.583003
330  0.092559  0.725438  0.708244  0.413599  0.580310
331  0.424742  0.063721  0.921735  0.792344  0.404538
332   0.51622  0.252023  0.500908  0.322258  0.298890
333  0.329516  0.697441  0.504897  0.259232  0.601372

[334 rows x 5 columns]>

In [114]:
newdf.to_numpy()

array([[0.3, 0.9160361226972238, 0.7056865822562279, 0.7260321497515998,
        0.9232383136476583],
       [0.4, 0.9788384338273961, 0.0965830151024829, 0.36,
        0.6924056615505919],
       [0.9039613417396604, 0.9609014720928359, 0.6774101095092363,
        0.4074163123943283, 0.8405070746412797],
       ...,
       [0.4247421694802618, 0.06372111663235835, 0.9217352674913618,
        0.7923435640845349, 0.4045380219212631],
       [0.5162200937848653, 0.2520226685322928, 0.5009078626463369,
        0.32225764644747423, 0.298889768179626],
       [0.3295157245115198, 0.6974409855584507, 0.5048965961445985,
        0.2592320346242294, 0.601371695576423]], dtype=object)

In [115]:
newdf.dtypes

0     object
1    float64
2    float64
3     object
4    float64
dtype: object

In [116]:
newdf.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,324,325,326,327,328,329,330,331,332,333
0,0.3,0.4,0.903961,0.132554,0.129319,0.27971,0.813585,0.210953,0.681265,0.893782,...,0.018412,0.721496,0.197583,0.280318,0.076378,0.071551,0.092559,0.424742,0.51622,0.329516
1,0.916036,0.978838,0.960901,0.657832,0.831966,0.759544,0.961278,0.960132,0.507173,0.893394,...,0.578004,0.948399,0.436003,0.294766,0.126304,0.976249,0.725438,0.063721,0.252023,0.697441
2,0.705687,0.096583,0.67741,0.603474,0.814104,0.765119,0.120109,0.037143,0.049549,0.763335,...,0.257184,0.851831,0.617745,0.935332,0.985902,0.936548,0.708244,0.921735,0.500908,0.504897
3,0.726032,0.36,0.407416,0.269729,0.633965,0.223099,0.790557,0.076562,0.17889,0.962357,...,0.129891,0.456179,0.838113,0.750128,0.172089,0.404837,0.413599,0.792344,0.322258,0.259232
4,0.923238,0.692406,0.840507,0.921647,0.440859,0.510042,0.351245,0.943652,0.807237,0.283149,...,0.235852,0.942578,0.962528,0.314052,0.637891,0.583003,0.58031,0.404538,0.29889,0.601372


In [117]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [119]:
newdf.sort_index(axis = 0)
#Axis = 0 -> rows
#Axis = 1 -> Columns
# By default ascending = True

Unnamed: 0,0,1,2,3,4
0,0.3,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [120]:
newdf.sort_index(axis = 0, ascending = False)

Unnamed: 0,0,1,2,3,4
333,0.329516,0.697441,0.504897,0.259232,0.601372
332,0.51622,0.252023,0.500908,0.322258,0.298890
331,0.424742,0.063721,0.921735,0.792344,0.404538
330,0.092559,0.725438,0.708244,0.413599,0.580310
329,0.071551,0.976249,0.936548,0.404837,0.583003
...,...,...,...,...,...
4,0.129319,0.831966,0.814104,0.633965,0.440859
3,0.132554,0.657832,0.603474,0.269729,0.921647
2,0.903961,0.960901,0.677410,0.407416,0.840507
1,0.4,0.978838,0.096583,0.36,0.692406


In [121]:
newdf.sort_index(axis = 1, ascending = False)

Unnamed: 0,4,3,2,1,0
0,0.923238,0.726032,0.705687,0.916036,0.3
1,0.692406,0.36,0.096583,0.978838,0.4
2,0.840507,0.407416,0.677410,0.960901,0.903961
3,0.921647,0.269729,0.603474,0.657832,0.132554
4,0.440859,0.633965,0.814104,0.831966,0.129319
...,...,...,...,...,...
329,0.583003,0.404837,0.936548,0.976249,0.071551
330,0.580310,0.413599,0.708244,0.725438,0.092559
331,0.404538,0.792344,0.921735,0.063721,0.424742
332,0.298890,0.322258,0.500908,0.252023,0.51622


In [122]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [123]:
newdf[0]

0           0.3
1           0.4
2      0.903961
3      0.132554
4      0.129319
         ...   
329    0.071551
330    0.092559
331    0.424742
332     0.51622
333    0.329516
Name: 0, Length: 334, dtype: object

In [124]:
type(newdf[0])

pandas.core.series.Series

In [125]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [126]:
newdf2 = newdf

In [127]:
newdf2[0][0] = 897

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf2[0][0] = 897


In [128]:
newdf2

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [129]:
newdf

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,0.4,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


# newdf2 is just a view of newdf

# just pointing so newdf2 will change newdf too

In [130]:
# To Copy

newdf2 = newdf[:] # also for pointer
# or newdf2 = newdf.copy() -> For copy


In [131]:
newdf2[0][1] = 3850305

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf2[0][1] = 3850305


In [132]:
newdf2

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [133]:
newdf

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [134]:
newdf2 = newdf.copy()

In [135]:
newdf2[0][2] = 2434

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf2[0][2] = 2434


In [136]:
newdf2

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,2434,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [137]:
newdf

Unnamed: 0,0,1,2,3,4
0,897,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [138]:
newdf.loc[0,0] = 654

In [140]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,654.0,0.916036,0.705687,0.726032,0.923238
1,3850305.0,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [141]:
newdf.columns = list('ABCDE')

In [142]:
newdf

Unnamed: 0,A,B,C,D,E
0,654,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [143]:
newdf.loc[0,0] = 654

In [144]:
newdf

Unnamed: 0,A,B,C,D,E,0
0,654,0.916036,0.705687,0.726032,0.923238,654.0
1,3850305,0.978838,0.096583,0.36,0.692406,
2,0.903961,0.960901,0.677410,0.407416,0.840507,
3,0.132554,0.657832,0.603474,0.269729,0.921647,
4,0.129319,0.831966,0.814104,0.633965,0.440859,
...,...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003,
330,0.092559,0.725438,0.708244,0.413599,0.580310,
331,0.424742,0.063721,0.921735,0.792344,0.404538,
332,0.51622,0.252023,0.500908,0.322258,0.298890,


In [145]:
newdf.loc[0, 'A'] = 669

In [146]:
newdf

Unnamed: 0,A,B,C,D,E,0
0,669,0.916036,0.705687,0.726032,0.923238,654.0
1,3850305,0.978838,0.096583,0.36,0.692406,
2,0.903961,0.960901,0.677410,0.407416,0.840507,
3,0.132554,0.657832,0.603474,0.269729,0.921647,
4,0.129319,0.831966,0.814104,0.633965,0.440859,
...,...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003,
330,0.092559,0.725438,0.708244,0.413599,0.580310,
331,0.424742,0.063721,0.921735,0.792344,0.404538,
332,0.51622,0.252023,0.500908,0.322258,0.298890,


In [148]:
newdf = newdf.drop(0, axis = 1)

In [149]:
newdf

Unnamed: 0,A,B,C,D,E
0,669,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [150]:
newdf.loc[[1,2], ['C', 'D']]
#Just rturning a new DF based on location

Unnamed: 0,C,D
1,0.096583,0.36
2,0.67741,0.407416


In [151]:
newdf

Unnamed: 0,A,B,C,D,E
0,669,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [152]:
newdf.loc[:, ['C', 'D']]

Unnamed: 0,C,D
0,0.705687,0.726032
1,0.096583,0.36
2,0.677410,0.407416
3,0.603474,0.269729
4,0.814104,0.633965
...,...,...
329,0.936548,0.404837
330,0.708244,0.413599
331,0.921735,0.792344
332,0.500908,0.322258


In [153]:
newdf.loc[[1,2], :]

Unnamed: 0,A,B,C,D,E
1,3850305.0,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507


In [154]:
newdf.loc[(newdf['A'] < 0.3)]

Unnamed: 0,A,B,C,D,E
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
5,0.27971,0.759544,0.765119,0.223099,0.510042
7,0.210953,0.960132,0.037143,0.076562,0.943652
12,0.083382,0.522748,0.739602,0.35295,0.827732
...,...,...,...,...,...
326,0.197583,0.436003,0.617745,0.838113,0.962528
327,0.280318,0.294766,0.935332,0.750128,0.314052
328,0.076378,0.126304,0.985902,0.172089,0.637891
329,0.071551,0.976249,0.936548,0.404837,0.583003


In [156]:
newdf.loc[(newdf['A'] < 0.3) & (newdf['C'] > 0.2)]

Unnamed: 0,A,B,C,D,E
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
5,0.27971,0.759544,0.765119,0.223099,0.510042
12,0.083382,0.522748,0.739602,0.35295,0.827732
14,0.022717,0.941756,0.202554,0.033485,0.454304
...,...,...,...,...,...
326,0.197583,0.436003,0.617745,0.838113,0.962528
327,0.280318,0.294766,0.935332,0.750128,0.314052
328,0.076378,0.126304,0.985902,0.172089,0.637891
329,0.071551,0.976249,0.936548,0.404837,0.583003


In [157]:
newdf.head(2)

Unnamed: 0,A,B,C,D,E
0,669,0.916036,0.705687,0.726032,0.923238
1,3850305,0.978838,0.096583,0.36,0.692406


In [158]:
newdf.iloc[0,4]
# Counting on basis of index using iloc irrespective of column or row name

0.9232383136476583

In [159]:
newdf.iloc[[0,5], [1,2]]

Unnamed: 0,B,C
0,0.916036,0.705687
5,0.759544,0.765119


In [160]:
newdf.head()

Unnamed: 0,A,B,C,D,E
0,669.0,0.916036,0.705687,0.726032,0.923238
1,3850305.0,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.67741,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859


In [161]:
newdf.drop([0])

Unnamed: 0,A,B,C,D,E
1,3850305,0.978838,0.096583,0.36,0.692406
2,0.903961,0.960901,0.677410,0.407416,0.840507
3,0.132554,0.657832,0.603474,0.269729,0.921647
4,0.129319,0.831966,0.814104,0.633965,0.440859
5,0.27971,0.759544,0.765119,0.223099,0.510042
...,...,...,...,...,...
329,0.071551,0.976249,0.936548,0.404837,0.583003
330,0.092559,0.725438,0.708244,0.413599,0.580310
331,0.424742,0.063721,0.921735,0.792344,0.404538
332,0.51622,0.252023,0.500908,0.322258,0.298890


In [162]:
newdf.drop(['A','C'], axis = 1)

Unnamed: 0,B,D,E
0,0.916036,0.726032,0.923238
1,0.978838,0.36,0.692406
2,0.960901,0.407416,0.840507
3,0.657832,0.269729,0.921647
4,0.831966,0.633965,0.440859
...,...,...,...
329,0.976249,0.404837,0.583003
330,0.725438,0.413599,0.580310
331,0.063721,0.792344,0.404538
332,0.252023,0.322258,0.298890


In [163]:
newdf.drop(['A','C'], axis = 1, inplace = True)# inplace = True modifies original df too

In [164]:
newdf

Unnamed: 0,B,D,E
0,0.916036,0.726032,0.923238
1,0.978838,0.36,0.692406
2,0.960901,0.407416,0.840507
3,0.657832,0.269729,0.921647
4,0.831966,0.633965,0.440859
...,...,...,...
329,0.976249,0.404837,0.583003
330,0.725438,0.413599,0.580310
331,0.063721,0.792344,0.404538
332,0.252023,0.322258,0.298890


In [165]:
newdf.head()

Unnamed: 0,B,D,E
0,0.916036,0.726032,0.923238
1,0.978838,0.36,0.692406
2,0.960901,0.407416,0.840507
3,0.657832,0.269729,0.921647
4,0.831966,0.633965,0.440859


In [166]:
newdf.reset_index()

Unnamed: 0,index,B,D,E
0,0,0.916036,0.726032,0.923238
1,1,0.978838,0.36,0.692406
2,2,0.960901,0.407416,0.840507
3,3,0.657832,0.269729,0.921647
4,4,0.831966,0.633965,0.440859
...,...,...,...,...
329,329,0.976249,0.404837,0.583003
330,330,0.725438,0.413599,0.580310
331,331,0.063721,0.792344,0.404538
332,332,0.252023,0.322258,0.298890


In [169]:
newdf.reset_index(drop = True ,inplace = True)

In [170]:
newdf

Unnamed: 0,B,D,E
0,0.916036,0.726032,0.923238
1,0.978838,0.36,0.692406
2,0.960901,0.407416,0.840507
3,0.657832,0.269729,0.921647
4,0.831966,0.633965,0.440859
...,...,...,...
329,0.976249,0.404837,0.583003
330,0.725438,0.413599,0.580310
331,0.063721,0.792344,0.404538
332,0.252023,0.322258,0.298890


In [171]:
newdf['B'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
329    False
330    False
331    False
332    False
333    False
Name: B, Length: 334, dtype: bool

In [175]:
newdf.loc[:, ['B']] = None

In [176]:
newdf

Unnamed: 0,B,D,E
0,,0.726032,0.923238
1,,0.36,0.692406
2,,0.407416,0.840507
3,,0.269729,0.921647
4,,0.633965,0.440859
...,...,...,...
329,,0.404837,0.583003
330,,0.413599,0.580310
331,,0.792344,0.404538
332,,0.322258,0.298890


In [177]:
newdf["B"].isnull()

0      True
1      True
2      True
3      True
4      True
       ... 
329    True
330    True
331    True
332    True
333    True
Name: B, Length: 334, dtype: bool

In [178]:
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
                   "toy" : [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp('1940-04-25'), pd.NaT]})

In [179]:
df

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [180]:
df.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [181]:
df.dropna(how = 'all')
#will only remove if all are na, by default how = 'any' -> removes any na

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [182]:
df = pd.DataFrame({'name': ['Alfred', 'Batman', 'Catwoman'],
                   "toy" : [np.nan, np.nan, np.nan],
                   "born": [pd.NaT, pd.Timestamp('1940-04-25'), pd.NaT]})

In [183]:
df

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,,1940-04-25
2,Catwoman,,NaT


In [185]:
df.dropna(how = 'all', axis = 1)

Unnamed: 0,name,born
0,Alfred,NaT
1,Batman,1940-04-25
2,Catwoman,NaT


In [186]:
f = pd.DataFrame({'name': ['Alfred', 'Batman', 'Alfred'],
                   "toy" : [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp('1940-04-25'), pd.NaT]})

In [187]:
f

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [188]:
f.drop_duplicates()

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [189]:
f.drop_duplicates(subset = ['name'])

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25


In [190]:
f.drop_duplicates(subset = ['name'], keep = 'last')
# by default keep = first

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [191]:
f.drop_duplicates(subset = ['name'], keep = False)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [192]:
df.shape

(3, 3)

In [194]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    3 non-null      object        
 1   toy     0 non-null      float64       
 2   born    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 200.0+ bytes


In [195]:
df['name'].value_counts(dropna = False)

Alfred      1
Batman      1
Catwoman    1
Name: name, dtype: int64

In [196]:
df['name'].value_counts(dropna = True)

Alfred      1
Batman      1
Catwoman    1
Name: name, dtype: int64

In [198]:
df['toy'].value_counts(dropna = True)

Series([], Name: toy, dtype: int64)

In [199]:
df['toy'].value_counts(dropna = False)

NaN    3
Name: toy, dtype: int64

In [200]:
df.isnull()

Unnamed: 0,name,toy,born
0,False,True,True
1,False,True,False
2,False,True,True


In [201]:
df.notnull()

Unnamed: 0,name,toy,born
0,True,False,False
1,True,False,True
2,True,False,False


In [None]:
#data = pd.read_)excel('data.xlsx',sheet_name = 'Sheet1')

Create a dataframe with contains only integers with 3 rows and 2 columns and run the following:


df.describe()

df.mean()

df.corr()

df.count()

df.max()

df.min()

df.median()

df.std()

In [202]:
df = pd.DataFrame({'Column1': [1, 2, 3],
        'Column2': [4, 5, 6]})

In [203]:
df

Unnamed: 0,Column1,Column2
0,1,4
1,2,5
2,3,6


In [205]:
df.index = ["A", "B", "C"]

In [206]:
df

Unnamed: 0,Column1,Column2
A,1,4
B,2,5
C,3,6


In [207]:
df.index = list("XYZ")

In [208]:
df

Unnamed: 0,Column1,Column2
X,1,4
Y,2,5
Z,3,6


In [209]:
df.describe()

Unnamed: 0,Column1,Column2
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


In [210]:
df.mean()

Column1    2.0
Column2    5.0
dtype: float64

In [211]:
df.corr()

Unnamed: 0,Column1,Column2
Column1,1.0,1.0
Column2,1.0,1.0


In [212]:
df.count()

Column1    3
Column2    3
dtype: int64

In [213]:
df.max()

Column1    3
Column2    6
dtype: int64

In [214]:
df.min()

Column1    1
Column2    4
dtype: int64

In [215]:
df.median()

Column1    2.0
Column2    5.0
dtype: float64

In [216]:
df.std()

Column1    1.0
Column2    1.0
dtype: float64