# 10 minutes to pandas

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

This is a short introduction to pandas, geared mainly for new users. 

Customarily, we import as follows:

In [1]:
import numpy as np

import pandas as pd

# Object creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])


In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [4]:
dates = pd.date_range("20130101", periods=6)

In [5]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

In [7]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.418959,-0.780659,-0.770191,-0.644822
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-04,-2.195193,1.163928,-0.278213,0.937113
2013-01-05,-2.159187,0.15225,0.322503,-0.366304
2013-01-06,0.920606,0.712668,1.549561,0.618919


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [8]:
df2 = pd.DataFrame(
     {
         "A": 1.0,
         "B": pd.Timestamp("20130102"),
         "C": pd.Series(1, index=list(range(4)), dtype="float32"),
         "D": np.array([3] * 4, dtype="int32"),
         "E": pd.Categorical(["test", "train", "test", "train"]),
         "F": "foo",
     }
 )

In [9]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.

In [10]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

# Viewing data

Here is how to view the top and bottom rows of the frame:

In [11]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.418959,-0.780659,-0.770191,-0.644822
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-04,-2.195193,1.163928,-0.278213,0.937113
2013-01-05,-2.159187,0.15225,0.322503,-0.366304


In [12]:
df.tail(4)

Unnamed: 0,A,B,C,D
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-04,-2.195193,1.163928,-0.278213,0.937113
2013-01-05,-2.159187,0.15225,0.322503,-0.366304
2013-01-06,0.920606,0.712668,1.549561,0.618919


Display the index, columns:

In [13]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [14]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [15]:
df.to_numpy()

array([[-1.41895877, -0.78065909, -0.77019117, -0.64482238],
       [-0.8810729 , -2.18718664,  0.53810372, -0.45063501],
       [ 0.90365291, -0.02534881,  0.35216401,  1.77925709],
       [-2.1951932 ,  1.16392847, -0.27821278,  0.93711294],
       [-2.15918738,  0.1522504 ,  0.32250302, -0.36630384],
       [ 0.920606  ,  0.71266786,  1.54956125,  0.61891943]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [16]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

# Note:

**DataFrame.to_numpy()** does not include the index or column labels in the output.

**describe()** shows a quick statistic summary of your data:

In [17]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.805026,-0.160725,0.285655,0.312255
std,1.417491,1.193706,0.787366,0.958648
min,-2.195193,-2.187187,-0.770191,-0.644822
25%,-1.97413,-0.591832,-0.128034,-0.429552
50%,-1.150016,0.063451,0.337334,0.126308
75%,0.457471,0.572563,0.491619,0.857565
max,0.920606,1.163928,1.549561,1.779257


Transposing your data:

In [18]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.418959,-0.881073,0.903653,-2.195193,-2.159187,0.920606
B,-0.780659,-2.187187,-0.025349,1.163928,0.15225,0.712668
C,-0.770191,0.538104,0.352164,-0.278213,0.322503,1.549561
D,-0.644822,-0.450635,1.779257,0.937113,-0.366304,0.618919


In [19]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,6.0,-0.805026,1.417491,-2.195193,-1.97413,-1.150016,0.457471,0.920606
B,6.0,-0.160725,1.193706,-2.187187,-0.591832,0.063451,0.572563,1.163928
C,6.0,0.285655,0.787366,-0.770191,-0.128034,0.337334,0.491619,1.549561
D,6.0,0.312255,0.958648,-0.644822,-0.429552,0.126308,0.857565,1.779257


Sorting by an axis:

In [20]:
 df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.644822,-0.770191,-0.780659,-1.418959
2013-01-02,-0.450635,0.538104,-2.187187,-0.881073
2013-01-03,1.779257,0.352164,-0.025349,0.903653
2013-01-04,0.937113,-0.278213,1.163928,-2.195193
2013-01-05,-0.366304,0.322503,0.15225,-2.159187
2013-01-06,0.618919,1.549561,0.712668,0.920606


Sorting by values:

In [21]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-01,-1.418959,-0.780659,-0.770191,-0.644822
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-05,-2.159187,0.15225,0.322503,-0.366304
2013-01-06,0.920606,0.712668,1.549561,0.618919
2013-01-04,-2.195193,1.163928,-0.278213,0.937113


# Selection:

## Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [22]:
df["A"]

2013-01-01   -1.418959
2013-01-02   -0.881073
2013-01-03    0.903653
2013-01-04   -2.195193
2013-01-05   -2.159187
2013-01-06    0.920606
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [23]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.418959,-0.780659,-0.770191,-0.644822
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-03,0.903653,-0.025349,0.352164,1.779257


In [24]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-04,-2.195193,1.163928,-0.278213,0.937113


## Selection by label

For getting a cross section using a label:

In [25]:
df.loc[dates[0]]

A   -1.418959
B   -0.780659
C   -0.770191
D   -0.644822
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [26]:
df.loc[:,["A","B"]]

Unnamed: 0,A,B
2013-01-01,-1.418959,-0.780659
2013-01-02,-0.881073,-2.187187
2013-01-03,0.903653,-0.025349
2013-01-04,-2.195193,1.163928
2013-01-05,-2.159187,0.15225
2013-01-06,0.920606,0.712668


Showing label slicing, both endpoints are included:

In [27]:
df.loc["20130102":"20130104", ["A","B"]]

Unnamed: 0,A,B
2013-01-02,-0.881073,-2.187187
2013-01-03,0.903653,-0.025349
2013-01-04,-2.195193,1.163928


Reduction in the dimensions of the returned object:

In [28]:
df.loc["20130102", ["A","B"]]

A   -0.881073
B   -2.187187
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [29]:
df.loc[dates[0], "A"]

-1.4189587739722653

For getting fast access to a scalar (equivalent to the prior method):

In [30]:
df.at[dates[0], "A"]

-1.4189587739722653

## Selection by position

Select via the position of the passed integers:

In [31]:
df.iloc[3]

A   -2.195193
B    1.163928
C   -0.278213
D    0.937113
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/Python:

In [32]:
df.iloc[3:5 , 0:2]

Unnamed: 0,A,B
2013-01-04,-2.195193,1.163928
2013-01-05,-2.159187,0.15225


By lists of integer position locations, similar to the NumPy/Python style:

In [33]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,-0.881073,0.538104
2013-01-03,0.903653,0.352164
2013-01-05,-2.159187,0.322503


For slicing rows explicitly:

In [34]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635
2013-01-03,0.903653,-0.025349,0.352164,1.779257


For slicing columns explicitly:

In [35]:
df.iloc[ : , 1:3 ]

Unnamed: 0,B,C
2013-01-01,-0.780659,-0.770191
2013-01-02,-2.187187,0.538104
2013-01-03,-0.025349,0.352164
2013-01-04,1.163928,-0.278213
2013-01-05,0.15225,0.322503
2013-01-06,0.712668,1.549561


For getting a value explicitly:

In [36]:
df.iloc[1,1]

-2.187186644184768

For getting fast access to a scalar (equivalent to the prior method):

In [37]:
df.iat[1,1]

-2.187186644184768

# Boolean indexing

Using a single column’s values to select data.

In [38]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2013-01-03,0.903653,-0.025349,0.352164,1.779257
2013-01-06,0.920606,0.712668,1.549561,0.618919


Selecting values from a DataFrame where a boolean condition is met.

In [39]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,,
2013-01-02,,,0.538104,
2013-01-03,0.903653,,0.352164,1.779257
2013-01-04,,1.163928,,0.937113
2013-01-05,,0.15225,0.322503,
2013-01-06,0.920606,0.712668,1.549561,0.618919


Using the isin() method for filtering:

In [40]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.418959,-0.780659,-0.770191,-0.644822,one
2013-01-02,-0.881073,-2.187187,0.538104,-0.450635,one
2013-01-03,0.903653,-0.025349,0.352164,1.779257,two
2013-01-04,-2.195193,1.163928,-0.278213,0.937113,three
2013-01-05,-2.159187,0.15225,0.322503,-0.366304,four
2013-01-06,0.920606,0.712668,1.549561,0.618919,three


In [41]:
df2[df2["E"].isin(["two", "four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.903653,-0.025349,0.352164,1.779257,two
2013-01-05,-2.159187,0.15225,0.322503,-0.366304,four


# Setting

Setting a new column automatically aligns the data by the indexes.

In [42]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [43]:
df["F"] = s1

Setting values by label:

In [44]:
df.at[dates[0], "A"] = 0

Setting values by position:

In [45]:
df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [46]:
df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations.

In [47]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.770191,5,
2013-01-02,-0.881073,-2.187187,0.538104,5,1.0
2013-01-03,0.903653,-0.025349,0.352164,5,2.0
2013-01-04,-2.195193,1.163928,-0.278213,5,3.0
2013-01-05,-2.159187,0.15225,0.322503,5,4.0
2013-01-06,0.920606,0.712668,1.549561,5,5.0


A where operation with setting.

In [48]:
df3 = df.copy()


In [49]:
df3[df3>0] = - df3

In [50]:
df3

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.770191,-5,
2013-01-02,-0.881073,-2.187187,-0.538104,-5,-1.0
2013-01-03,-0.903653,-0.025349,-0.352164,-5,-2.0
2013-01-04,-2.195193,-1.163928,-0.278213,-5,-3.0
2013-01-05,-2.159187,-0.15225,-0.322503,-5,-4.0
2013-01-06,-0.920606,-0.712668,-1.549561,-5,-5.0


# Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [51]:
df4 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])


In [52]:
df4.loc[dates[0] : dates[1], "E"] = 1

In [53]:
df4

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.770191,5,,1.0
2013-01-02,-0.881073,-2.187187,0.538104,5,1.0,1.0
2013-01-03,0.903653,-0.025349,0.352164,5,2.0,
2013-01-04,-2.195193,1.163928,-0.278213,5,3.0,


To drop any rows that have missing data.

In [54]:
df4.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2013-01-02,-0.881073,-2.187187,0.538104,5,1.0,1.0


Filling missing data.

In [55]:
df4.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.770191,5,5.0,1.0
2013-01-02,-0.881073,-2.187187,0.538104,5,1.0,1.0
2013-01-03,0.903653,-0.025349,0.352164,5,2.0,5.0
2013-01-04,-2.195193,1.163928,-0.278213,5,3.0,5.0


To get the boolean mask where values are nan.

In [56]:
pd.isnull(df4)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


# OPERATIONS

## Stats

Operations in general exclude missing data

Performing a descriptive statistic

In [57]:
df.mean()

A   -0.568532
B   -0.030615
C    0.285655
D    5.000000
F    3.000000
dtype: float64

Same operation on the other axis:

In [58]:
df.mean(1)

2013-01-01    1.057452
2013-01-02    0.693969
2013-01-03    1.646094
2013-01-04    1.338104
2013-01-05    1.463113
2013-01-06    2.636567
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas
automatically broadcasts along the specified dimension.

In [59]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

In [60]:
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [61]:
df.sub(s, axis='index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.096347,-1.025349,-0.647836,4.0,1.0
2013-01-04,-5.195193,-1.836072,-3.278213,2.0,0.0
2013-01-05,-7.159187,-4.84775,-4.677497,0.0,-1.0
2013-01-06,,,,,


## Apply

Applying functions to the data

In [62]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.770191,5,
2013-01-02,-0.881073,-2.187187,-0.232087,10,1.0
2013-01-03,0.02258,-2.212535,0.120077,15,3.0
2013-01-04,-2.172613,-1.048607,-0.158136,20,6.0
2013-01-05,-4.331801,-0.896357,0.164367,25,10.0
2013-01-06,-3.411195,-0.183689,1.713928,30,15.0


In [63]:
df.apply(lambda x: x.max() - x.min())

A    3.115799
B    3.351115
C    2.319752
D    0.000000
F    4.000000
dtype: float64

Histogramming

In [64]:
s = pd.Series(np.random.randint(0, 7, size=10))

In [65]:
s

0    5
1    6
2    0
3    0
4    6
5    5
6    6
7    2
8    1
9    0
dtype: int32

In [66]:
s.value_counts()

6    3
0    3
5    2
2    1
1    1
dtype: int64

## String Method

Series is equipped with a set of string processing methods in the str attribute that make it easy to
operate on each element of the array, as in the code snippet below. Note that patternmatching
in
str generally uses regular expressions by default (and in some cases always uses them).

In [67]:
s = pd.Series(['A', 'B', 'C', 'Aa145ba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0          a
1          b
2          c
3    aa145ba
4       baca
5        NaN
6       caba
7        dog
8        cat
dtype: object

## Merge

### Concat

Pandas provides various facilities for easily combining together Series, DataFrame, and Panel
objects with various kinds of set logic for the indexes and relational algebra functionality in the case
of join / mergetype
operations.

Concatenating pandas objects together with **concat()**:

In [68]:
df = pd.DataFrame(np.random.randn(10, 4))

In [None]:
df

Unnamed: 0,0,1,2,3
0,0.978842,1.331871,-0.481503,1.553392
1,-0.271619,0.691589,-0.14436,-1.394343
2,2.783805,0.10444,-1.177722,-1.198737
3,0.582939,-0.989887,0.659813,-0.330212
4,-1.884458,-0.122492,0.810439,0.704778
5,0.296384,2.626819,-0.473024,0.392419
6,-1.098717,0.204286,0.269362,1.115737
7,0.828431,0.234322,-0.734549,1.185173
8,0.58779,-0.365592,0.481941,-2.23958
9,-0.010623,1.497592,-0.673841,0.435061


break it into pieces

In [None]:
pieces = [df[:3], df[3:7], df[7:]]

In [None]:
pieces

In [None]:
pd.concat(pieces)

## Join

SQL style merges:

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on="key")

## Append

Append rows to a dataframe:

In [None]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

In [None]:
df

In [None]:
s = df.iloc[3]

In [None]:
s

In [None]:
df.append(s, ignore_index=True)
df

## Grouping

By **“group by”** we are referring to a process involving one or more of the following steps:

* **Spliting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure


    
    

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
  'foo', 'bar', 'foo', 'foo'],
  'B' : ['one', 'one', 'two', 'three',
  'two', 'two', 'one', 'three'],
  'C' : np.random.randn(8),
  'D' : np.random.randn(8)})

In [None]:
df

Grouping and then applying a function **sum** to the resulting groups:

In [None]:
df.groupby('A').sum()

In [None]:
df.groupby('B').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function:

In [None]:
df.groupby(["A", "B"]).sum()

In [None]:
df.groupby(["B", "A"]).sum()

## Reshaping

### Stack

In [None]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))

In [None]:
tuples

In [None]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [None]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

In [None]:
df2 = df[:4]

In [None]:
df2

The **stack()** method “compresses” a level in the DataFrame’s columns.

In [None]:
stacked = df2.stack()

In [None]:
stacked

In [None]:
pd.DataFrame(stacked)

With a “stacked” DataFrame or Series (having a **MultiIndex** as the index), the inverse operation of
**stack()** is **unstack()**, which by default unstacks the last level:

In [None]:
stacked.unstack()

In [None]:
stacked.unstack(1)

In [None]:
stacked.unstack(0)

## Pivot Tables

In [None]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
 'B' : ['A', 'B', 'C'] * 4,
 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
 'D' : np.random.randn(12),
 'E' : np.random.randn(12)})

In [None]:
df

We can produce **pivot tables** from this data very easily:

In [None]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

## Time Series

Pandas has simple, powerful, and efficient functionality for performing resampling operations during
frequency conversion (e.g., converting secondly data into 5minutely
data). This is extremely
common in, but not limited to, financial applications

In [None]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

In [None]:
ts.resample('5Min')

### Time zone representation

In [None]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

In [None]:
ts_utc = ts.tz_localize('UTC')

In [None]:
ts_utc

**Convert to another time zone**

In [None]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

In [None]:
ps = ts.to_period()
ps

In [None]:
ps.to_timestamp()

Converting between period and timestamp enables some convenient arithmetic functions to be
used. In the following example, we convert a quarterly frequency with year ending in November to
9am of the end of the month following the quarter end:

In [None]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

## Categoricals

Since version 0.15, pandas can include categorical data in a **DataFrame**.

In [None]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a','a','e']})

Convert the raw grades to a categorical data type.

In [None]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

Rename the categories to more meaningful names (assigning to **Series.cat.categories()** is
inplace!)

In [None]:
df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under **Series
.cat()** return a new **Series** per default).

In [None]:
df["grade"] = df["grade"].cat.set_categories(
        ["very bad", "bad", "medium", "good", "very good"]
    )
   

df["grade"]

Sorting is per order in the categories, not lexical order:

In [None]:
df.sort_values(by="grade")

Grouping by a categorical column also shows empty categories:

In [None]:
df.groupby("grade").size()

## Plotting

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.close("all")

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

In [None]:
ts = ts.cumsum()

In [None]:
ts.plot()

On a DataFrame, the **plot()** method is a convenience to plot all of the columns with labels:

In [None]:
df = pd.DataFrame(
         np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
     )

In [None]:
df = df.cumsum()

In [None]:
plt.figure()

In [None]:
df.plot()
plt.legend(loc='best')

## Getting data in/out

## CSV

Writing to a csv file:

In [None]:
df.to_csv("10mpandas.csv")

Reading from a csv file:

In [None]:
pd.read_csv("10mpandas.csv")