# Essential basic functionality

In [5]:
import pandas as pd
import numpy as np

In [207]:
Index = pd.date_range("05/09/2024" , periods=8)

In [209]:
Index

DatetimeIndex(['2024-05-09', '2024-05-10', '2024-05-11', '2024-05-12',
               '2024-05-13', '2024-05-14', '2024-05-15', '2024-05-16'],
              dtype='datetime64[ns]', freq='D')

In [211]:
s = pd.Series(np.random(5), index=["a","b","c","d","e"])

TypeError: 'module' object is not callable

In [213]:
s = pd.Series(np.random.randn(5), index=["a","b","c","d","e"])

In [215]:
s

a   -0.278786
b    0.707601
c   -1.262127
d   -0.825999
e    0.502408
dtype: float64

In [217]:
df = pd.DataFrame(np.random.randn(8,3), index=Index, columns=["A","B","C"])

In [219]:
df

Unnamed: 0,A,B,C
2024-05-09,-0.669373,-0.897791,-2.039007
2024-05-10,0.474056,-2.301712,0.012047
2024-05-11,-0.651026,-0.109305,0.475347
2024-05-12,1.079372,0.858243,1.103918
2024-05-13,0.367591,1.496114,-0.591148
2024-05-14,0.222514,-0.772699,0.642954
2024-05-15,0.495095,-1.283508,-0.112395
2024-05-16,2.551928,1.247187,1.468325


# Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [222]:
series = pd.Series(np.random.randn(1000))

In [224]:
series

0     -0.603932
1      0.825112
2      1.703419
3      0.358948
4      0.413435
         ...   
995    1.972014
996   -0.300611
997    1.290141
998   -0.100424
999   -0.343887
Length: 1000, dtype: float64

In [226]:
series.head()

0   -0.603932
1    0.825112
2    1.703419
3    0.358948
4    0.413435
dtype: float64

In [228]:
series.tail()

995    1.972014
996   -0.300611
997    1.290141
998   -0.100424
999   -0.343887
dtype: float64

# Attributes and underlying data

pandas object have no of attibute eanbling to acesss you metadata

shape: gives the axis dimensions of the object, consistent with ndarray

Axis labels
    Series: index (only axis)

    DataFrame: index (rows) and columns

Note, these attributes can be safely assigned to!



In [230]:
df[:2]

Unnamed: 0,A,B,C
2024-05-09,-0.669373,-0.897791,-2.039007
2024-05-10,0.474056,-2.301712,0.012047


In [232]:
df.columns = [x.lower() for x in df.columns]

In [234]:
df

Unnamed: 0,a,b,c
2024-05-09,-0.669373,-0.897791,-2.039007
2024-05-10,0.474056,-2.301712,0.012047
2024-05-11,-0.651026,-0.109305,0.475347
2024-05-12,1.079372,0.858243,1.103918
2024-05-13,0.367591,1.496114,-0.591148
2024-05-14,0.222514,-0.772699,0.642954
2024-05-15,0.495095,-1.283508,-0.112395
2024-05-16,2.551928,1.247187,1.468325


pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual da
ta and do the actual computation. For many types, the underlying array is a numpy.ndarray

To get the actual data inside a Index or Series, use the .array property

In [236]:
s.array

<NumpyExtensionArray>
[-0.2787855784414722,  0.7076012105159336, -1.2621274011981924,
 -0.8259987993942098,  0.5024081835470091]
Length: 5, dtype: float64

In [238]:
s.index.array

<NumpyExtensionArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them are a bit beyond the scope of this introduction. See dtypes for more.

In [240]:
s.to_numpy()

array([-0.27878558,  0.70760121, -1.2621274 , -0.8259988 ,  0.50240818])

In [242]:
np.array

<function numpy.array>

In [244]:
np.asarray(s)

array([-0.27878558,  0.70760121, -1.2621274 , -0.8259988 ,  0.50240818])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. 

NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz

A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded

Timezones may be preserved with dtype=object



In [246]:
ser = pd.Series(pd.date_range("2024", periods=2, tz="CET" ))

In [248]:
ser

0   2024-01-01 00:00:00+01:00
1   2024-01-02 00:00:00+01:00
dtype: datetime64[ns, CET]

In [250]:
ser.to_numpy(dtype=object)

array([Timestamp('2024-01-01 00:00:00+0100', tz='CET'),
       Timestamp('2024-01-02 00:00:00+0100', tz='CET')], dtype=object)

Or thrown away with dtype='datetime64[ns]'

In [252]:
ser.to_numpy(dtype = "datetime64[ns]")

array(['2023-12-31T23:00:00.000000000', '2024-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [254]:
df.to_numpy()

array([[-0.66937338, -0.89779063, -2.03900744],
       [ 0.4740555 , -2.30171153,  0.01204717],
       [-0.65102576, -0.10930545,  0.47534673],
       [ 1.07937209,  0.85824326,  1.10391811],
       [ 0.36759109,  1.49611382, -0.59114759],
       [ 0.22251376, -0.77269936,  0.64295446],
       [ 0.49509489, -1.28350795, -0.11239501],
       [ 2.55192804,  1.24718716,  1.46832454]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

# IN the past, Pandas recommended to use Series.values and DataFrame.values to extract data from series or dataframe,
# its recommanded to not use .values insted of use a .array() .to_numpy() 

# Drowsback of .values()
    When Series contain an extention type and if used series.values() to extarct the then data or it returns Numpy array or extention array. 
    Series.array alwasy return extention array & did not copy the data
    series.to_numpy() always return numpy array & did copy/corsing the value.



    DataFrame contains a mixture of data type, DataFrame.values copy the data and corsing the value to commomn type and its reletivly it expensive operation.
    DataFrame.to_numpy() makes it cleared that return  Numpyarray 

# Accelerated operations


Pandas has support certain types of binary numerical and boolean operations using the  <b>numexpr</b>  and <b>bottleneck</b> liabraries.

<b>numexpr</b> uses smart chunking, caching, and multiple cores.

<b>bottleneck</b> is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.


Here is a sample (using 100 column x 100,000 row DataFrames): df2

22.04

36.50

0.6039

b>Operation</b>  <b>0.11.0 (ms)</b> <b>Prior Version (ms)</b> <b>Ratio to Prior</b>

df1 > df           2 13.32             125.35                   0.1063

df1 * df2           21.71              36.63                    0.5928

df1 + df2          22.04               36.50                     0.6039

These are both enabled to be used by default, you can control this by setting the options:

In [256]:
bottleneckf = pd.set_option("compute.use_bottleneck", False)

In [258]:
print(bottleneckf)

None


In [260]:
bottleneckt = pd.set_option("compute.use_bottleneck", True)

In [262]:
print(bottleneckt)

None


In [264]:
numexprf = pd.set_option("compute.use_numexpr",False)

In [266]:
print(numexprf)

None


In [268]:
numexprt = pd.set_option("compute.use_numexpr",True)

In [270]:
print(numexprt)

None


# Flexible binary operations

With binary operations between pandas data structures, there are two key points of interest:

<b> Broadcasting behavior between higher- (e.g. DataFrame)</b> and <b>lower-dimensional (e.g. Series) objects</b>.

<b>Missing data in computations</b>.

We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

# Matching / broadcasting behavior


DataFrame has the methods <br> add(), sub(), mul(), div() and related functions radd(), rsub(),</br> … for carrying out binary operations. 


or broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the <br> axis </br> keyword:

In [7]:
df = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=["a","b","c"]),
    "two": pd.Series(np.random.randn(4), index=["a","b","c","d"]),
    "three": pd.Series(np.random.randn(3), index=["b","c","d"])
    
    
})

In [9]:
df

Unnamed: 0,one,two,three
a,0.197182,-0.097513,
b,-0.969735,1.171266,0.230784
c,0.202477,1.000609,1.324386
d,,-0.43455,0.241848


In [13]:
row =  df.iloc[1]

In [15]:
row

one     -0.969735
two      1.171266
three    0.230784
Name: b, dtype: float64

In [17]:
column= df["two"]

In [19]:
column

a   -0.097513
b    1.171266
c    1.000609
d   -0.434550
Name: two, dtype: float64

In [29]:
df.sub(row, axis="columns")

Unnamed: 0,one,two,three
a,1.166917,-1.268779,
b,0.0,0.0,0.0
c,1.172212,-0.170657,1.093601
d,,-1.605816,0.011064


In [31]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,1.166917,-1.268779,
b,0.0,0.0,0.0
c,1.172212,-0.170657,1.093601
d,,-1.605816,0.011064


In [33]:
df.sub(column, axis="index")

Unnamed: 0,one,two,three
a,0.294696,0.0,
b,-2.141001,0.0,-0.940481
c,-0.798131,0.0,0.323777
d,,0.0,0.676399


In [35]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,0.294696,0.0,
b,-2.141001,0.0,-0.940481
c,-0.798131,0.0,0.323777
d,,0.0,0.676399


you can align a level of a MultiIndexed DataFrame with a Series.

In [42]:
dfmi = df.copy()

In [48]:
df.index- pd.m=MultiIndex.from_tuples(
    [("1", a),("1","b"),(1,"c"),(2,"a")], names=["first","second"] 
)

SyntaxError: cannot assign to expression here. Maybe you meant '==' instead of '='? (2558599893.py, line 1)

In [50]:
dfmi.sub(column, axis=0, level="second")

Unnamed: 0,one,two,three
a,0.294696,0.0,
b,-2.141001,0.0,-0.940481
c,-0.798131,0.0,0.323777
d,,0.0,0.676399


Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand sid

In [59]:
s = pd.Series(np.arange(10))

In [61]:
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

In [65]:
div, rem= divmod(s,3)

In [67]:
div

0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int32

In [69]:
rem

0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int32

In [75]:
idx= pd.Index(np.arange(10))

idx

In [79]:
dev ,rem = divmod(idx ,3)

In [81]:
dev

Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int32')

In [83]:
rem

Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int32')

In [89]:
div, rem = divmod(s ,[2,2,3,3,4,4,5,5,6,6])

In [91]:
div

0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int32

In [93]:
rem

0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int32

# Missing data / operations with fill values


in Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing


For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [97]:
df2 = df.copy()

In [99]:
df

Unnamed: 0,one,two,three
a,0.197182,-0.097513,
b,-0.969735,1.171266,0.230784
c,0.202477,1.000609,1.324386
d,,-0.43455,0.241848


In [101]:
df2.loc["a", "three"]= 1.0

In [103]:
df2

Unnamed: 0,one,two,three
a,0.197182,-0.097513,1.0
b,-0.969735,1.171266,0.230784
c,0.202477,1.000609,1.324386
d,,-0.43455,0.241848


In [105]:
df

Unnamed: 0,one,two,three
a,0.197182,-0.097513,
b,-0.969735,1.171266,0.230784
c,0.202477,1.000609,1.324386
d,,-0.43455,0.241848


In [107]:
df+df2

Unnamed: 0,one,two,three
a,0.394365,-0.195027,
b,-1.93947,2.342532,0.461569
c,0.404955,2.001217,2.648771
d,,-0.869101,0.483697


In [109]:
df.add(df2, fill_value=0)

Unnamed: 0,one,two,three
a,0.394365,-0.195027,1.0
b,-1.93947,2.342532,0.461569
c,0.404955,2.001217,2.648771
d,,-0.869101,0.483697


# Flexible comparisons

Series and DataFrame have the binary comparison methods <br> eq, ne, lt, gt, le, and ge </br>

In [113]:
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [115]:
df2.ne(df)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool. These boolean objects can be used in indexing operations

# Boolean reductions

for summerise boolen result can apply the reductions: empty, any(), all(), and bool()

In [122]:
(df > 0).all()

one      False
two      False
three    False
dtype: bool

In [124]:
(df > 0 ).any()

one      True
two      True
three    True
dtype: bool

can reduce to a final boolean value.

In [127]:
(df > 0).any().any()

True

can test if a pandas object is empty, via the empty property

In [130]:
df.empty

False

In [134]:
pd.DataFrame(columns=list("ABC")).empty

True

# Warning
    Asserting the truthiness of a pandas object will raise an error, as the testing of the emptiness or values is ambiguous.

In [139]:
if df:
    print(True)

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [141]:
df and df2

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# Comparing if objects are equivalent

may find that there is more than one way to compute the same result

example, consider df + df and df * 2
    o test that these two computations produce the same result, given the tools shown above, you might imagine using (df + df == df * 2).all(). But in fact, this expression is False:
    

In [146]:
df+df==df*2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [148]:
(df+df==df*2).all()

one      False
two       True
three    False
dtype: bool

Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs do not compare as equals:

In [151]:
np.nan == np.nan

False

but to test equality nDframes have <br> equals() </br> method

In [163]:
(df+df).equals(df2*2)

False

# Note:
    for equallity test series or Dataframe  need to be in same order

In [166]:
df1= pd.DataFrame({
    "col":["foo", 0, np.nan]
})


In [168]:
df1

Unnamed: 0,col
0,foo
1,0
2,


In [170]:
df2 = pd.DataFrame({
    "col":[np.nan, 0, "foo"]
}, index=[2,1, 0])

In [172]:
df2

Unnamed: 0,col
2,
1,0
0,foo


In [174]:
df1.equals(df2)

False

In [176]:
df1.equals(df2.sort_index())

True