# Essential basic functionality

In [10]:
import pandas as pd
import numpy as np

In [4]:
Index = pd.date_range("05/09/2024")

ValueError: Of the four parameters: start, end, periods, and freq, exactly three must be specified

In [7]:
Index = pd.date_range("05/09/2024" , periods=8)

In [8]:
Index

DatetimeIndex(['2024-05-09', '2024-05-10', '2024-05-11', '2024-05-12',
               '2024-05-13', '2024-05-14', '2024-05-15', '2024-05-16'],
              dtype='datetime64[ns]', freq='D')

In [12]:
s = pd.Series(np.random(5), index=["a","b","c","d","e"])

TypeError: 'module' object is not callable

In [15]:
s = pd.Series(np.random.randn(5), index=["a","b","c","d","e"])

In [16]:
s

a    0.034311
b   -0.127555
c    1.379252
d   -1.235054
e   -0.184127
dtype: float64

In [19]:
df = pd.DataFrame(np.random.randn(8,3), index=Index, columns=["A","B","C"])

In [20]:
df

Unnamed: 0,A,B,C
2024-05-09,1.194649,-1.022221,0.287236
2024-05-10,-1.421366,-0.767551,2.272302
2024-05-11,0.538915,0.706642,0.908632
2024-05-12,-1.006558,-0.154825,-0.755527
2024-05-13,0.826105,-0.388073,0.36142
2024-05-14,0.080057,1.409322,0.66718
2024-05-15,-2.494931,-0.752174,-0.08726
2024-05-16,-2.890577,0.347505,-0.814468


# Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [22]:
series = pd.Series(np.random.randn(1000))

In [23]:
series

0     -0.590928
1     -2.801617
2     -0.037419
3      0.383765
4      0.185884
         ...   
995    0.781974
996    0.020362
997   -0.035957
998    1.116456
999    1.266377
Length: 1000, dtype: float64

In [24]:
series.head()

0   -0.590928
1   -2.801617
2   -0.037419
3    0.383765
4    0.185884
dtype: float64

In [25]:
series.tail()

995    0.781974
996    0.020362
997   -0.035957
998    1.116456
999    1.266377
dtype: float64

# Attributes and underlying data

pandas object have no of attibute eanbling to acesss you metadata

shape: gives the axis dimensions of the object, consistent with ndarray

Axis labels
    Series: index (only axis)

    DataFrame: index (rows) and columns

Note, these attributes can be safely assigned to!



In [28]:
df[:2]

Unnamed: 0,A,B,C
2024-05-09,1.194649,-1.022221,0.287236
2024-05-10,-1.421366,-0.767551,2.272302


In [29]:
df.columns = [x.lower() for x in df.columns]

In [30]:
df

Unnamed: 0,a,b,c
2024-05-09,1.194649,-1.022221,0.287236
2024-05-10,-1.421366,-0.767551,2.272302
2024-05-11,0.538915,0.706642,0.908632
2024-05-12,-1.006558,-0.154825,-0.755527
2024-05-13,0.826105,-0.388073,0.36142
2024-05-14,0.080057,1.409322,0.66718
2024-05-15,-2.494931,-0.752174,-0.08726
2024-05-16,-2.890577,0.347505,-0.814468


pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual da
ta and do the actual computation. For many types, the underlying array is a numpy.ndarray

To get the actual data inside a Index or Series, use the .array property

In [31]:
s.array

<NumpyExtensionArray>
[0.034311237914647984, -0.12755500913328432,   1.3792517632603642,
  -1.2350542291825803, -0.18412739736155095]
Length: 5, dtype: float64

In [32]:
s.index.array

<NumpyExtensionArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas uses them are a bit beyond the scope of this introduction. See dtypes for more.

In [33]:
s.to_numpy()

array([ 0.03431124, -0.12755501,  1.37925176, -1.23505423, -0.1841274 ])

In [34]:
np.array

<function numpy.array>

In [39]:
np.asarray(s)

array([ 0.03431124, -0.12755501,  1.37925176, -1.23505423, -0.1841274 ])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. 

NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz

A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded

Timezones may be preserved with dtype=object



In [41]:
ser = pd.Series(pd.date_range("2024", periods=2, tz="CET" ))

In [42]:
ser

0   2024-01-01 00:00:00+01:00
1   2024-01-02 00:00:00+01:00
dtype: datetime64[ns, CET]

In [43]:
ser.to_numpy(dtype=object)

array([Timestamp('2024-01-01 00:00:00+0100', tz='CET'),
       Timestamp('2024-01-02 00:00:00+0100', tz='CET')], dtype=object)

Or thrown away with dtype='datetime64[ns]'

In [47]:
ser.to_numpy(dtype = "datetime64[ns]")

array(['2023-12-31T23:00:00.000000000', '2024-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [48]:
df.to_numpy()

array([[ 1.19464939, -1.0222209 ,  0.28723636],
       [-1.42136617, -0.76755113,  2.2723024 ],
       [ 0.53891496,  0.70664205,  0.90863195],
       [-1.00655806, -0.15482476, -0.75552692],
       [ 0.82610519, -0.38807318,  0.36142005],
       [ 0.08005725,  1.40932207,  0.66718026],
       [-2.49493076, -0.75217442, -0.08726001],
       [-2.89057721,  0.34750474, -0.81446806]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

# IN the past, Pandas recommended to use Series.values and DataFrame.values to extract data from series or dataframe,
# its recommanded to not use .values insted of use a .array() .to_numpy() 

# Drowsback of .values()
    When Series contain an extention type and if used series.values() to extarct the then data or it returns Numpy array or extention array. 
    Series.array alwasy return extention array & did not copy the data
    series.to_numpy() always return numpy array & did copy/corsing the value.



    DataFrame contains a mixture of data type, DataFrame.values copy the data and corsing the value to commomn type and its reletivly it expensive operation.
    DataFrame.to_numpy() makes it cleared that return  Numpyarray 

# Accelerated operations


Pandas has support certain types of binary numerical and boolean operations using the  <b>numexpr</b>  and <b>bottleneck</b> liabraries.

<b>numexpr</b> uses smart chunking, caching, and multiple cores.

<b>bottleneck</b> is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.


Here is a sample (using 100 column x 100,000 row DataFrames): df2

22.04

36.50

0.6039

b>Operation</b>  <b>0.11.0 (ms)</b> <b>Prior Version (ms)</b> <b>Ratio to Prior</b>

df1 > df           2 13.32             125.35                   0.1063

df1 * df2           21.71              36.63                    0.5928

df1 + df2          22.04               36.50                     0.6039

These are both enabled to be used by default, you can control this by setting the options:

In [53]:
bottleneckf = pd.set_option("compute.use_bottleneck", False)

In [56]:
print(bottleneckf)

None


In [59]:
bottleneckt = pd.set_option("compute.use_bottleneck", True)

In [60]:
print(bottleneckt)

None


In [63]:
numexprf = pd.set_option("compute.use_numexpr",False)

In [64]:
print(numexprf)

None


In [65]:
numexprt = pd.set_option("compute.use_numexpr",True)

In [66]:
print(numexprt)

None
