# Pandas  

## What is Pandas?  
- **Pandas** is an open-source data analysis library written in Python.  
- It leverages the power and speed of **NumPy** to make data analysis and preprocessing easy for data scientists.  
- Provides **rich and highly robust data operations**.  

## Pandas Data Structures  
Pandas has two main data structures:  

- **Series** → A one-dimensional array with indexes. It stores a single column or row of data in a DataFrame.  
- **DataFrame** → A tabular, spreadsheet-like structure where each row contains one or multiple columns.  

### Key Differences:  
- **Series** → A one-dimensional labeled array capable of holding any type of data.  
- **DataFrame** → A two-dimensional labeled data structure with columns of potentially different types of data.  
 


In [1]:
# Import necessary Libraries
import numpy as np
import pandas as pd

In [2]:
# Create a dictionary
dict1 = {
    "Name" : ["Kushal", "Mukesh", "Ashok", "Shailendra"],
    "Marks" : [92, 34, 24, 17],
    "City" : ["Kathmandu", "Dhangadhi", "Dharan", "Chitwan"]
}

In [3]:
df = pd.DataFrame(dict1) # Converts our dictionary to data frame
df

Unnamed: 0,Name,Marks,City
0,Kushal,92,Kathmandu
1,Mukesh,34,Dhangadhi
2,Ashok,24,Dharan
3,Shailendra,17,Chitwan


In [4]:
df.to_csv("friends.csv") # Create a csv file with the data

In [5]:
df.to_csv("no-index-friends.csv", index = False) # Create a csv file neglecting index

In [6]:
df.head(2) # Displlays given number rows from the head side

Unnamed: 0,Name,Marks,City
0,Kushal,92,Kathmandu
1,Mukesh,34,Dhangadhi


In [7]:
df.tail(2) # Displays given number of rows from tail side

Unnamed: 0,Name,Marks,City
2,Ashok,24,Dharan
3,Shailendra,17,Chitwan


In [8]:
df.describe() # Statistical analysis for all the numerical columns 

Unnamed: 0,Marks
count,4.0
mean,41.75
std,34.21866
min,17.0
25%,22.25
50%,29.0
75%,48.5
max,92.0


In [9]:
friends = pd.read_csv("friends.csv") # This will open our csv file by adding a index so make sure to make index = Flase while saving csv
friends

Unnamed: 0.1,Unnamed: 0,Name,Marks,City
0,0,Kushal,92,Kathmandu
1,1,Mukesh,34,Dhangadhi
2,2,Ashok,24,Dharan
3,3,Shailendra,17,Chitwan


In [10]:
friends = pd.read_csv("no-index-friends.csv") # Since we don't have index in csv it is perfect
friends

Unnamed: 0,Name,Marks,City
0,Kushal,92,Kathmandu
1,Mukesh,34,Dhangadhi
2,Ashok,24,Dharan
3,Shailendra,17,Chitwan


In [11]:
type(friends) # It displays the type of the objects

pandas.core.frame.DataFrame

In [12]:
friends.dtypes # Shows the datatypes of all the columns in data frame

Name     object
Marks     int64
City     object
dtype: object

In [13]:
# Changing values in the data frame
friends["Marks"] # Provides only the selected column data

0    92
1    34
2    24
3    17
Name: Marks, dtype: int64

In [14]:
friends["Marks"][0] # Provides only the data in selected index of selected column

np.int64(92)

In [15]:
# So you can change the data at that index
friends["Marks"][0] = 98 # This will be confusing for python interpreter whether to return copy or view so we use loc() functions

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  friends["Marks"][0] = 98 # This will be confusing for python interpreter whether to return copy or view so we use loc() functions


In [16]:
# Now can see that the value has been changed
friends

Unnamed: 0,Name,Marks,City
0,Kushal,98,Kathmandu
1,Mukesh,34,Dhangadhi
2,Ashok,24,Dharan
3,Shailendra,17,Chitwan


In [17]:
# You can update it to the csv file or create a new csv file 
# Note that use same name if you want to update and different name if you want to create new
friends.to_csv("no-index-friends.csv", index = False) # index = False is necessary because it avoids addition of an extra index column in csv file

In [18]:
# Changing the index of the data frame
friends.index = ["first", "second", "third", "fourth"]
friends

Unnamed: 0,Name,Marks,City
first,Kushal,98,Kathmandu
second,Mukesh,34,Dhangadhi
third,Ashok,24,Dharan
fourth,Shailendra,17,Chitwan


In [19]:
ser = pd.Series(np.random.rand(34)) # Create a series with random numbers
ser

0     0.255537
1     0.488719
2     0.170599
3     0.218206
4     0.710519
5     0.234675
6     0.178419
7     0.935754
8     0.727654
9     0.320553
10    0.782689
11    0.003208
12    0.846739
13    0.739438
14    0.638278
15    0.546133
16    0.787092
17    0.024155
18    0.262551
19    0.366481
20    0.886248
21    0.998307
22    0.738574
23    0.858650
24    0.038089
25    0.466997
26    0.232350
27    0.449926
28    0.094679
29    0.253393
30    0.648286
31    0.341548
32    0.291772
33    0.861827
dtype: float64

In [20]:
new_df = pd.DataFrame(np.random.rand(334, 5), index = np.arange(334)) # Create a Data Frame with random numbers
new_df

Unnamed: 0,0,1,2,3,4
0,0.889041,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.681220,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762
...,...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416,0.518274
330,0.331857,0.542560,0.172620,0.229618,0.079808
331,0.439550,0.170010,0.164840,0.798853,0.582940
332,0.266609,0.102856,0.686420,0.747656,0.194007


In [21]:
new_df.index # Shows all the index in the data frame

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
      dtype='int64', length=334)

In [22]:
new_df.columns # Show all the columns in the data frame

RangeIndex(start=0, stop=5, step=1)

In [23]:
new_df.to_numpy() # Converts data frame to numpy object

array([[0.88904093, 0.20667926, 0.06661144, 0.30830289, 0.9703634 ],
       [0.62048721, 0.13351493, 0.73645183, 0.27352139, 0.49408699],
       [0.68122035, 0.82101314, 0.878545  , 0.08329666, 0.36185822],
       ...,
       [0.43954977, 0.17000955, 0.16484048, 0.79885272, 0.58294009],
       [0.26660887, 0.10285629, 0.6864197 , 0.74765566, 0.19400698],
       [0.04357519, 0.26770482, 0.91097745, 0.1369159 , 0.37731174]],
      shape=(334, 5))

In [24]:
new_df.T # It transpose the data frame like matrices

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,324,325,326,327,328,329,330,331,332,333
0,0.889041,0.620487,0.68122,0.212505,0.993227,0.357507,0.697967,0.597865,0.379402,0.139876,...,0.185819,0.212955,0.509602,0.324553,0.820533,0.455688,0.331857,0.43955,0.266609,0.043575
1,0.206679,0.133515,0.821013,0.074566,0.119845,0.749749,0.03815,0.73651,0.67352,0.813827,...,0.224821,0.865151,0.875064,0.738586,0.420664,0.092123,0.54256,0.17001,0.102856,0.267705
2,0.066611,0.736452,0.878545,0.427335,0.975493,0.138524,0.917317,0.803868,0.418368,0.43754,...,0.758746,0.103733,0.792656,0.199311,0.841621,0.646842,0.17262,0.16484,0.68642,0.910977
3,0.308303,0.273521,0.083297,0.099088,0.383383,0.852122,0.938535,0.193152,0.837051,0.427148,...,0.638466,0.00545,0.403342,0.83119,0.589492,0.186416,0.229618,0.798853,0.747656,0.136916
4,0.970363,0.494087,0.361858,0.887688,0.508762,0.583829,0.319381,0.622917,0.678862,0.512652,...,0.253936,0.997019,0.166855,0.224726,0.593415,0.518274,0.079808,0.58294,0.194007,0.377312


In [25]:
new_df.sort_index(axis = 0, ascending=False) # Sort the index of data frame: axis = 0 for rows and 1 for column

Unnamed: 0,0,1,2,3,4
333,0.043575,0.267705,0.910977,0.136916,0.377312
332,0.266609,0.102856,0.686420,0.747656,0.194007
331,0.439550,0.170010,0.164840,0.798853,0.582940
330,0.331857,0.542560,0.172620,0.229618,0.079808
329,0.455688,0.092123,0.646842,0.186416,0.518274
...,...,...,...,...,...
4,0.993227,0.119845,0.975493,0.383383,0.508762
3,0.212505,0.074566,0.427335,0.099088,0.887688
2,0.681220,0.821013,0.878545,0.083297,0.361858
1,0.620487,0.133515,0.736452,0.273521,0.494087


In [26]:
type(new_df[0]) # The combination of Series is DataFrame

pandas.core.series.Series

In [27]:
new_df2 = new_df # Here new_df2 is just a view of new_df, If you change new_df2 then new_df will also change because both are pointing same memory location
new_df2[0][0] = 0
new_df # The element in new_df will also be changed

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  new_df2[0][0] = 0


Unnamed: 0,0,1,2,3,4
0,0.000000,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.681220,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762
...,...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416,0.518274
330,0.331857,0.542560,0.172620,0.229618,0.079808
331,0.439550,0.170010,0.164840,0.798853,0.582940
332,0.266609,0.102856,0.686420,0.747656,0.194007


In [28]:
new_df3 = new_df.copy() # Creates a copy. Now changes will not affect new_df
new_df3[0][0] = 0.5
new_df # No change in original while changing copy

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  new_df3[0][0] = 0.5


Unnamed: 0,0,1,2,3,4
0,0.000000,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.681220,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762
...,...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416,0.518274
330,0.331857,0.542560,0.172620,0.229618,0.079808
331,0.439550,0.170010,0.164840,0.798853,0.582940
332,0.266609,0.102856,0.686420,0.747656,0.194007


In [29]:
new_df.loc[0,0] = 654 # Using loc() function is a proper way to change objects in Data Frame
new_df.head()

Unnamed: 0,0,1,2,3,4
0,654.0,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.68122,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762


In [30]:
new_df.columns = list("ABCDE") # Change columns name to A B C D E
new_df

Unnamed: 0,A,B,C,D,E
0,654.000000,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.681220,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762
...,...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416,0.518274
330,0.331857,0.542560,0.172620,0.229618,0.079808
331,0.439550,0.170010,0.164840,0.798853,0.582940
332,0.266609,0.102856,0.686420,0.747656,0.194007


In [31]:
#new_df.loc[0, 0] = 100 # Will create a new column 0 with all other elements NaN as we have changed columns name
new_df.loc[0, "A"] = 90
new_df

Unnamed: 0,A,B,C,D,E
0,90.000000,0.206679,0.066611,0.308303,0.970363
1,0.620487,0.133515,0.736452,0.273521,0.494087
2,0.681220,0.821013,0.878545,0.083297,0.361858
3,0.212505,0.074566,0.427335,0.099088,0.887688
4,0.993227,0.119845,0.975493,0.383383,0.508762
...,...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416,0.518274
330,0.331857,0.542560,0.172620,0.229618,0.079808
331,0.439550,0.170010,0.164840,0.798853,0.582940
332,0.266609,0.102856,0.686420,0.747656,0.194007


In [32]:
new_df = new_df.drop("E", axis = 1) # Remove E column
# Always remember axis = 0 for row and axis = 1 for column
new_df

Unnamed: 0,A,B,C,D
0,90.000000,0.206679,0.066611,0.308303
1,0.620487,0.133515,0.736452,0.273521
2,0.681220,0.821013,0.878545,0.083297
3,0.212505,0.074566,0.427335,0.099088
4,0.993227,0.119845,0.975493,0.383383
...,...,...,...,...
329,0.455688,0.092123,0.646842,0.186416
330,0.331857,0.542560,0.172620,0.229618
331,0.439550,0.170010,0.164840,0.798853
332,0.266609,0.102856,0.686420,0.747656


In [33]:
new_df.loc[[1, 2], ["C", "D"]] # Locate the selected rows and columns

Unnamed: 0,C,D
1,0.736452,0.273521
2,0.878545,0.083297


In [34]:
#If we need all row and columns we use : on the requird place
new_df.loc[[1, 2], :]

Unnamed: 0,A,B,C,D
1,0.620487,0.133515,0.736452,0.273521
2,0.68122,0.821013,0.878545,0.083297


In [35]:
new_df.loc[(new_df["A"] > 0.3)] # Locate the rows in which column A has vlaue > 0.3

Unnamed: 0,A,B,C,D
0,90.000000,0.206679,0.066611,0.308303
1,0.620487,0.133515,0.736452,0.273521
2,0.681220,0.821013,0.878545,0.083297
4,0.993227,0.119845,0.975493,0.383383
5,0.357507,0.749749,0.138524,0.852122
...,...,...,...,...
327,0.324553,0.738586,0.199311,0.831190
328,0.820533,0.420664,0.841621,0.589492
329,0.455688,0.092123,0.646842,0.186416
330,0.331857,0.542560,0.172620,0.229618


In [36]:
new_df.loc[(new_df["A"] > 0.3) & (new_df["C"] > 0.1)] # Locate the rows in which column A has vlaue > 0.3 and column C has value > 0.1

Unnamed: 0,A,B,C,D
1,0.620487,0.133515,0.736452,0.273521
2,0.681220,0.821013,0.878545,0.083297
4,0.993227,0.119845,0.975493,0.383383
5,0.357507,0.749749,0.138524,0.852122
6,0.697967,0.038150,0.917317,0.938535
...,...,...,...,...
327,0.324553,0.738586,0.199311,0.831190
328,0.820533,0.420664,0.841621,0.589492
329,0.455688,0.092123,0.646842,0.186416
330,0.331857,0.542560,0.172620,0.229618


In [37]:
new_df.iloc[0, 3] # ilock() takes value of [i, j] just like in matrices

np.float64(0.30830289089866436)

In [38]:
new_df.loc[0, "D"] # loc() takes values of [row, column] in Data Frame

np.float64(0.30830289089866436)

In [39]:
# iloc() can be used instead of loc() every where just remember that it takes [i, j] instead of [row, cloumn]
new_df.iloc[[0, 5], [1, 2]]

Unnamed: 0,B,C
0,0.206679,0.066611
5,0.749749,0.138524


In [40]:
new_df.drop(["A", "C"], axis = 1) # Delete the selected columns as axis = 1 (Returns a copy)
# Remember that this will not change the original Data Frame until we same it like: new_df = ....

Unnamed: 0,B,D
0,0.206679,0.308303
1,0.133515,0.273521
2,0.821013,0.083297
3,0.074566,0.099088
4,0.119845,0.383383
...,...,...
329,0.092123,0.186416
330,0.542560,0.229618
331,0.170010,0.798853
332,0.102856,0.747656


In [41]:
# If we use inplace=True it will change the original Data Frame
new_df.drop(["C", "D"], axis = 1, inplace=True)

In [42]:
new_df

Unnamed: 0,A,B
0,90.000000,0.206679
1,0.620487,0.133515
2,0.681220,0.821013
3,0.212505,0.074566
4,0.993227,0.119845
...,...,...
329,0.455688,0.092123
330,0.331857,0.542560
331,0.439550,0.170010
332,0.266609,0.102856


In [43]:
# Lets remove some rows from middle
new_df.drop([i for i in range(6, 330)], axis=0, inplace=True)

In [44]:
new_df

Unnamed: 0,A,B
0,90.0,0.206679
1,0.620487,0.133515
2,0.68122,0.821013
3,0.212505,0.074566
4,0.993227,0.119845
5,0.357507,0.749749
330,0.331857,0.54256
331,0.43955,0.17001
332,0.266609,0.102856
333,0.043575,0.267705


In [45]:
# We need to reset the index now
# new_df.reset_index() # This will add a new column named index so drop=True is necessary
# new_df.reset_index(drop=True) #This will not change the original Data Frame so inplace=True is required
new_df.reset_index(drop=True, inplace=True)

In [46]:
new_df

Unnamed: 0,A,B
0,90.0,0.206679
1,0.620487,0.133515
2,0.68122,0.821013
3,0.212505,0.074566
4,0.993227,0.119845
5,0.357507,0.749749
6,0.331857,0.54256
7,0.43955,0.17001
8,0.266609,0.102856
9,0.043575,0.267705


In [47]:
new_df["B"].isnull() # Returns True if the null and False otherwise

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: B, dtype: bool

In [48]:
# new_df["A"] = None # Make all the values in A column null
new_df.loc[:, "A"] = None # Good practice

  new_df.loc[:, "A"] = None # Good practice


In [49]:
new_df

Unnamed: 0,A,B
0,,0.206679
1,,0.133515
2,,0.821013
3,,0.074566
4,,0.119845
5,,0.749749
6,,0.54256
7,,0.17001
8,,0.102856
9,,0.267705


In [50]:
new_df["B"].isnull() # All the values in B columns are changed to null

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: B, dtype: bool

In [51]:
df_new = pd.DataFrame( {"name" : ['Alfred', 'Batman', 'Catwoman'],
                    "toy" : [np.nan, 'Batmobile', 'Bullwhip'],
                    "born" : [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})
df_new

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [52]:
df_new.dropna() # Removes all rows containing na because default axis=0

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [53]:
df_new.dropna(how="all") # Removes rows if all values in it are na because default axis=0

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [54]:
# You can also choose axis for deletion
df_new.dropna(axis=1)

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [55]:
# Lets make some duplicates in Data Frame
df_new.loc[2, "name"] = "Alfred"
df_new

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [56]:
# Lets remove the duplicates now
df_new.drop_duplicates(subset=['name'], keep= 'last')
# keep = 'first' is default
# keep = 'last' removes all other than last
# keep = False removes all

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [57]:
df_new.shape # Provides information about number of rows and columns

(3, 3)

In [58]:
df_new.info() # Displays information about the Data Frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    3 non-null      object        
 1   toy     2 non-null      object        
 2   born    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 204.0+ bytes


In [59]:
df_new['toy'].value_counts(dropna=False) # Displays counts of the elements in column 
# if dropna=False it displays without removing NaN values
# if dropna=True it displays after removing NaN values

toy
NaN          1
Batmobile    1
Bullwhip     1
Name: count, dtype: int64

In [60]:
df_new.notnull() # Returns True if value is not null else return False

Unnamed: 0,name,toy,born
0,True,False,False
1,True,True,True
2,True,True,False
