## <span style='color:orange;'>Reindexing and altering labels   </span>
 


In [3]:
import pandas as pd
import numpy as np


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [4]:
series_var= pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e','s']) #it aassigh 6 random number generated by numpy to the index assigned
print(series_var)
# a    0.058215
# b   -0.339988
# c    0.011333
# d    0.887064
# e    0.059441
# s   -1.172545
# dtype: float64
series_var['a']  #a    0.058215 give as output





a    0.661664
b   -0.381765
c    0.656258
d    0.591415
e    1.525307
s   -1.006181
dtype: float64


0.6616636176809699

<div class='alert alert-block alert-info'>
<b>1. np.random.randn(6):</b>
This part uses the numpy.random module to generate a NumPy array of 6 random numbers.
randn specifically draws random values from a standard normal distribution (mean 0, variance 1).
So, these 6 numbers represent random floating-point values.
 </div>

<div class='alert alert-block alert-info'>
<b>2. pd.Series(...):
</b>This part uses the pandas library to create a Series object.
A Series is essentially a one-dimensional labeled array, similar to a list or NumPy array, but with labels (indexes) attached to each element.
</div>

- Here, we provide a list of 6 strings: 'a', 'b', 'c', 'd', 'e', 's'.

- These labels will be associated with the corresponding random values in the NumPy array.

In [5]:
reindex_var = series_var.reindex(['e', 'b', 'f', 'd','g'])
reindex_var

e    1.525307
b   -0.381765
f         NaN
d    0.591415
g         NaN
dtype: float64

<div class='alert alert-block alert-info'>
<b>Reindexing</b> It reindex previous value  <b> on the basis of label assigned in list </b>
 and if elements inside [ ] is not having corresponding random values will assign  <b>NaN</b> by default. </div>

In [6]:
df=pd.read_csv('./data.csv')
df

Unnamed: 0,Code,Age_single_years,Census_night_population_count,Census_usually_resident_population_count
0,000,Less than one year,58665,58158
1,001,One year,58356,58020
2,002,Two years,59013,58719
3,003,Three years,60279,59970
4,004,Four years,60348,60054
...,...,...,...,...
117,117,117 years,0,0
118,118,118 years,0,0
119,119,119 years,0,0
120,120,120 years and over,0,0


<div class='alert alert-block alert-info'>
<b>read_csv</b>  is a powerful function in the pandas library used to read data from comma-separated values (CSV) files into a DataFrame. Here's a breakdown of its functionality:</div>

<div class='alert alert-block alert-warning'>
<b>Must mention file location accurately </b> Otherwiss gives error like 

<b> <i>FileNotFoundError: [Errno 2] No such file or directory: '[your_path].csv'
</i></b>
</div>

In [7]:
df.reindex(index=[1, 3,4], columns=['Age_single_years', 'Census_night_population_count', 'Census_usually_resident_population_count'])

Unnamed: 0,Age_single_years,Census_night_population_count,Census_usually_resident_population_count
1,One year,58356,58020
3,Three years,60279,59970
4,Four years,60348,60054


 <span style='color:orange; font-size:30px'> Reindexing to align with another object </span>



In [55]:
df2 = df[1:3]
df2

Unnamed: 0,one,two,three
b,2.095898,0.78208,1.905351
c,-0.386739,0.948849,0.602617


<span style='color:orange;font-size:30px'>Aligning objects with each other with align
  </span>



<div class='alert alert-block alert-info'>
<b>The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):</b> </div>

 *  join='outer': take the union of the indexes (default)
 *  join='left': use the calling object’s index
 *  join='right': use the passed object’s index
 *  join='inner': intersect the indexes 


In [9]:
series_var= pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
series_var

a    0.276017
b    0.263798
c   -0.125714
d   -0.178770
e   -0.032935
dtype: float64

In [57]:
s1 = series_var[0:3]
s1

a    0.276017
b    0.263798
c   -0.125714
dtype: float64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b>Is  include 0 'th index but exclude 3'th </b> </div>

In [59]:
s2 = series_var[2:]
s2

c   -0.125714
d   -0.178770
e   -0.032935
dtype: float64

<div class='alert alert-block alert-info'>
<b>Starts from 2 and goes till end</b> </div>

In [61]:
s1.align(s2)


(a    0.276017
 b    0.263798
 c   -0.125714
 d         NaN
 e         NaN
 dtype: float64,
 a         NaN
 b         NaN
 c   -0.125714
 d   -0.178770
 e   -0.032935
 dtype: float64)

<div class='alert alert-block alert-info'>
<b>The align method in pandas is used to align two objects along a particular axis. It returns two aligned Series, with missing values filled with NaN where necessary.</b> </div>

In [13]:
s1.align(s2, join='inner')


(b    0.263798
 c   -0.125714
 dtype: float64,
 b    0.263798
 c   -0.125714
 dtype: float64)

<div class='alert alert-block alert-info'>
<b>the result will contain only those indices present in both s1 and s2.

</b> </div>

In [62]:
s1.align(s2, join='left')

(a    0.276017
 b    0.263798
 c   -0.125714
 dtype: float64,
 a         NaN
 b         NaN
 c   -0.125714
 dtype: float64)

In [15]:
s1

b    0.263798
c   -0.125714
dtype: float64

In [16]:
s2

b    0.263798
c   -0.125714
d   -0.178770
e   -0.032935
dtype: float64

In [17]:
df = pd.DataFrame({
    'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])
    })
df

Unnamed: 0,one,two,three
a,-1.750722,1.021922,
b,-1.150935,0.029617,-0.645406
c,1.752288,0.154089,-0.982166
d,,0.368516,-0.207255


In [18]:
df2 = pd.DataFrame(
    {'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'e']),
   'three' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd'])}
   )
df2

Unnamed: 0,two,three
a,-1.496773,1.101906
b,-0.353832,0.948401
c,0.293855,-1.123875
d,,-1.096554
e,0.202505,


In [19]:
df.align(df2, join='inner')

(        two     three
 a  1.021922       NaN
 b  0.029617 -0.645406
 c  0.154089 -0.982166
 d  0.368516 -0.207255,
         two     three
 a -1.496773  1.101906
 b -0.353832  0.948401
 c  0.293855 -1.123875
 d       NaN -1.096554)

In [20]:
df.align(df2, join='inner', axis=0)

(        one       two     three
 a -1.750722  1.021922       NaN
 b -1.150935  0.029617 -0.645406
 c  1.752288  0.154089 -0.982166
 d       NaN  0.368516 -0.207255,
         two     three
 a -1.496773  1.101906
 b -0.353832  0.948401
 c  0.293855 -1.123875
 d       NaN -1.096554)

## <span style='color:orange;font-size:30px'> Filling while reindexing </span>


<div class='alert alert-block alert-info'>
<b>reindex()</b>  takes an optional parameter method which is a filling method chosen from the following options:</div>

-  <b>  pad / ffill: Fill values forward  </b>

-  <b>  bfill / backfill: Fill values backward </b>

-  <b> nearest: Fill from the nearest index value </b>
 
 

<span style='color:pink;font-size:20px'>date_range Usecase  </span>


<div class='alert alert-block alert-info'>
<b>pd.date_range</b> 

- That's a powerful function in the pandas library for generating sequences of dates. 

- It returns an object called a  <b> DatetimeIndex </b>
, which is essentially a list of dates with specific time information.

- <b> Flexibility in specifying dates: You can define the range in several ways: </b>

- -     Start and end dates
- -     Periods :Specify the number of dates you want to generate
- -     Frequency : Define the time interval between each date in the sequence.




</div>


In [21]:
date_range = pd.date_range(start='2024-01-01', end='2024-02-01') #withStartAndEnd
date_range



DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
               '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08',
               '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12',
               '2024-01-13', '2024-01-14', '2024-01-15', '2024-01-16',
               '2024-01-17', '2024-01-18', '2024-01-19', '2024-01-20',
               '2024-01-21', '2024-01-22', '2024-01-23', '2024-01-24',
               '2024-01-25', '2024-01-26', '2024-01-27', '2024-01-28',
               '2024-01-29', '2024-01-30', '2024-01-31', '2024-02-01'],
              dtype='datetime64[ns]', freq='D')

In [22]:
rng = pd.date_range('1/3/2023', periods=4) #withPeriods
rng

DatetimeIndex(['2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'], dtype='datetime64[ns]', freq='D')

In [23]:
# for frequency do yourself

In [24]:
ts = pd.Series(np.random.randn(4), index=rng)
ts

2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-05    0.127656
2023-01-06   -1.214498
Freq: D, dtype: float64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b> Creating Series : pd.Series(1st Args,2nd Args)</b>

- creates a pandas Series object. Inside the parenthesis, we provide:

    -  <b>     1st Args</b>   The data: In this case, it's the array of four random numbers.

    -  <b>     2nd Args</b> The index: The index is the DatetimeIndex object created and stored in rng in above </div>

In [25]:
ts2=ts[[0,2,2,2]]
ts2

  ts2=ts[[0,2,2,2]]


2023-01-03   -0.668364
2023-01-05    0.127656
2023-01-05    0.127656
2023-01-05    0.127656
dtype: float64

<div class='alert alert-block alert-info'>
<b>

Creates a new Series named ts2 by selecting specific elements from the original Series ts</b>
-  4 element [0,2,2,3] = 4 result with corresponding DatetimeIndex
- difference between dates is obtained on the basis of difference on  elements inside [ ]

 </div>

In [26]:
ts2=ts[[0,1,3]]
ts2

  ts2=ts[[0,1,3]]


2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-06   -1.214498
dtype: float64

In [27]:
ts2.reindex(ts.index)

2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-05         NaN
2023-01-06   -1.214498
Freq: D, dtype: float64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b>The reindexing process tries to match each element in ts2 to a corresponding date in the new index (ts.index).</b> 

-  If found it insert corresponding value otherwise  <b> NaN </b>

</div>

In [28]:
print("ts2 values are")
print(ts2)
ts2.reindex(ts.index, method='ffill')

ts2 values are
2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-06   -1.214498
dtype: float64


2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-05    1.744597
2023-01-06   -1.214498
Freq: D, dtype: float64

<div class='alert alert-block alert-info'>
<b>Forward filling:</b>

-   method='ffill' specifies that missing values (NaNs) should be filled by using the last available value before the gap.
 
 - - 2000-01-05 is not present in ts2 but value on it is filled by using the last available value before the gap which is the value of 2000-01-04
 </div>

In [29]:
ts2.reindex(ts.index, method='bfill')

2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-05   -1.214498
2023-01-06   -1.214498
Freq: D, dtype: float64

In [30]:
ts2.reindex(ts.index, method='nearest')

2023-01-03   -0.668364
2023-01-04    1.744597
2023-01-05   -1.214498
2023-01-06   -1.214498
Freq: D, dtype: float64

<div class='alert alert-block alert-success'>
<b>method='nearest'</b>  specifies that missing values (NaNs) should be filled by using the value closest in time (either before or after the gap). </div>

<div class='alert alert-block alert-warning'>
<b>Confused on which one to take ?</b> </div>

<span style='color:orange;font-size:30px'> Dropping labels from an axis
 </span>


 <b> A method closely related to reindex is the drop() function. It removes a set of labels from an axis: </b>


In [31]:
df = pd.DataFrame({
    'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])
    })
df

Unnamed: 0,one,two,three
a,-0.04953,-0.966456,
b,2.095898,0.78208,1.905351
c,-0.386739,0.948849,0.602617
d,,3.062088,0.82885


In [32]:
df.drop(['a', 'd'], axis=0)

Unnamed: 0,one,two,three
b,2.095898,0.78208,1.905351
c,-0.386739,0.948849,0.602617


In [33]:
df.drop(['one'], axis=1)

Unnamed: 0,two,three
a,-0.966456,
b,0.78208,1.905351
c,0.948849,0.602617
d,3.062088,0.82885


<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>

 <b>drop()</b> 
- This Delete the Rows <b>(axis=0 ||row axis=1||Cols)</b> 'a' and 'b'  from the dataSet</div>

<span style='color:pink; font-size:25px;'> Renaming / mapping labels </span>


In [34]:
series_var

a    0.276017
b    0.263798
c   -0.125714
d   -0.178770
e   -0.032935
dtype: float64

In [35]:
series_var.rename(str.upper)

A    0.276017
B    0.263798
C   -0.125714
D   -0.178770
E   -0.032935
dtype: float64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b>series_var.rename(str.upper)</b>

-  Aims to rename the index labels of a pandas Series named series_var by applying the str.upper function to each label.</div>

<!-- - #### If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values).
- #####  <b>But if you pass a dict or Series, it need only contain a subset of the labels as keys:</b>
 -->


In [36]:
df

Unnamed: 0,one,two,three
a,-0.04953,-0.966456,
b,2.095898,0.78208,1.905351
c,-0.386739,0.948849,0.602617
d,,3.062088,0.82885


<div class='alert alert-block alert-warning'>
<b>Why some have NaN as values</b> </div>

In [37]:
df.rename(columns={'one' : 'foo', 'two' : 'bar'},
  index={'a' : 'apple', 'b' : 'banana', 'd' : 'Orange','c':'cup cake'})

  

Unnamed: 0,foo,bar,three
apple,-0.04953,-0.966456,
banana,2.095898,0.78208,1.905351
cup cake,-0.386739,0.948849,0.602617
Orange,,3.062088,0.82885


<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b>Rename</b> 

- Rename Rows and Cols by assigning corresponding value to the row and column label with new label

- where rows are called as index 
</div>

- <span style='color:pink;font-size:30px'>Sorting by index and value </span>

- <span style='color:pink;font-size:30px'>Smallest / largest values</span>
- <span style='color:pink;font-size:30px'>Sorting by a multi-index column</span>




<span style='color:orange;font-size:30px'> There are two obvious kinds of sorting that you may be interested in </span>

- 1 <b> sorting by label </b>
 
- 2 <b> sorting by actual values </b>
 <b style='color:pink;' >  
 
The primary method for sorting axis labels (indexes) across data structures is the sort_index() method.
</b>



In [38]:
df

Unnamed: 0,one,two,three
a,-0.04953,-0.966456,
b,2.095898,0.78208,1.905351
c,-0.386739,0.948849,0.602617
d,,3.062088,0.82885


In [39]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
columns=['three', 'two', 'one'])
unsorted_df

Unnamed: 0,three,two,one
a,,-0.966456,-0.04953
d,0.82885,3.062088,
c,0.602617,0.948849,-0.386739
b,1.905351,0.78208,2.095898


<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b></b>Rearranging the position of row and columns in the table without compromising values corresponding to label or indexes </div>

In [40]:
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,-0.966456,-0.04953
b,1.905351,0.78208,2.095898
c,0.602617,0.948849,-0.386739
d,0.82885,3.062088,


<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

- -  Sorted Rows

In [41]:
unsorted_df.sort_index(ascending=False) #ascending is false so it starts from last index 'd' and stop at first index 'a'

Unnamed: 0,three,two,one
d,0.82885,3.062088,
c,0.602617,0.948849,-0.386739
b,1.905351,0.78208,2.095898
a,,-0.966456,-0.04953


In [42]:
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,-0.04953,,-0.966456
d,,0.82885,3.062088
c,-0.386739,0.602617,0.948849
b,2.095898,1.905351,0.78208


<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

- -  Sorted Columns

In [43]:
df1 = pd.DataFrame({'one':[1,2,3],'two':[11,22,33],'three':[111,222,333]})
df1

Unnamed: 0,one,two,three
0,1,11,111
1,2,22,222
2,3,33,333


In [44]:

# Sort the DataFrame by the 'two' column
df1_sorted = df1.sort_values(by='two')

df1_sorted

Unnamed: 0,one,two,three
0,1,11,111
1,2,22,222
2,3,33,333


In [45]:
df1

Unnamed: 0,one,two,three
0,1,11,111
1,2,22,222
2,3,33,333


In [46]:
df1[['one', 'two', 'three']].sort_values(by=['one','two'])

Unnamed: 0,one,two,three
0,1,11,111
1,2,22,222
2,3,33,333


 <b style='color:red; font-size:20px' > Smallest / largest values </b>

In [47]:
s = pd.Series(np.random.permutation(10))

s


0    1
1    3
2    5
3    7
4    8
5    4
6    2
7    6
8    0
9    9
dtype: int64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b> Generate a random permutation of the integers from 0 to 9 (10 exclusive) using NumPy's permutation function.</b> </div>

<div class='alert alert-block alert-info'>
<b>The resulting Series s will have random order of integers from 0 to 9. Keep in mind that the exact values will vary each time you run this code due to the randomness introduced by np.random.permutation.</b> </div>

In [48]:
s1=s.sort_values(ascending=True)
s1


8    0
0    1
6    2
1    3
5    4
2    5
7    6
3    7
4    8
9    9
dtype: int64

<h1 style='color: black; background-color:pink ;padding:5px'>Observation</h1>

<div class='alert alert-block alert-info'>
<b>Values as listed in Ascending Order</b> </div>