# py_datatable wiki 

In [62]:
# Importing the necessary libraries
import datatable as dt
from datatable import f,by,count,sum,update,sort
dt.init_styles()

## 1. Data Manipulations

### 1. 1 How to sort a datatable frame in descending order.?

We have created a sample dataframe with two columns such as product(character type) and totals(numeric type) using a frame object from dt and assigned it to a variable called X.

In [63]:
X = dt.Frame(product=["apples", "spam", "goo", "bobcat", "gold"], 
                 totals=[5.4, 2.777, 0.1, 2.9, 11.1])

In [64]:
X

Unnamed: 0_level_0,product,totals
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,apples,5.4
1,spam,2.777
2,goo,0.1
3,bobcat,2.9
4,gold,11.1


As you might have already know about the datatable syntax as below 



                                        DT[I,J,BY|SORT|JOIN]
                                        
                                       
                                       
                                       
                                       
For now look at the sort function, it takes eigther a single column or multiple columns in, and it would be applicable for character and numeric type fields


1. In below code chunk case-1 we have passed a column totals in sort so that it arranges the data frame in ascending order considering the column(total).

2. In code chunk case-2 the same column is given with appending a symbol(-) so that it arranges the data frame in descending order considering the column(total).

3. In code chunk case - 3 we are trying to arrange the dataframe in ascending order of the products column

In [65]:
# case - 1
X[:,:,sort(f.totals)]

Unnamed: 0_level_0,product,totals
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,goo,0.1
1,spam,2.777
2,bobcat,2.9
3,apples,5.4
4,gold,11.1


In [66]:
# case - 2 
X[:,:,sort(-f.totals)]

Unnamed: 0_level_0,product,totals
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,gold,11.1
1,apples,5.4
2,bobcat,2.9
3,spam,2.777
4,goo,0.1


In [67]:
# case - 3
X[:,:,sort(f.product)]

Unnamed: 0_level_0,product,totals
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,apples,5.4
1,bobcat,2.9
2,gold,11.1
3,goo,0.1
4,spam,2.777


Let us create a one more dataframe with repeated values of products as below

In [68]:
X = dt.Frame(products=['apples','spam','apples','gold','spam'],
             totals=[20,40,35,10,5])

We are now summing off the totals per each category of products and arrange it in descending order of newly created column tot_sum

In [69]:
X[:,{'tot_sum':sum(f.totals)},by(f.products)
 ][:,:,sort(-f.tot_sum)
  ]

Unnamed: 0_level_0,products,tot_sum
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,apples,55
1,spam,45
2,gold,10


### 1.2 How to count the number of instances for each category using group by in pydatadable?

Here is our basic syntax representation of datatable frame- 


                                        DT[I,J,BY|SORT|JOIN]

A sample dataframe created with the column name **languages** and would like to count how many of students are interested in learning each language category using aggregations such as by along with count,min,max,mean etc etc..

Yes, its correct we should use a function called **count** to caluclate the number of observations and let us see how it works below. 

In [70]:
prog_lang_dt = dt.Frame(languages= ['html', 'R', 'R', 'html', 'R', 'javascript',
                                    'R', 'javascript', 'html'])

In [71]:
prog_lang_dt

Unnamed: 0_level_0,languages
Unnamed: 0_level_1,▪▪▪▪
0,html
1,R
2,R
3,html
4,R
5,javascript
6,R
7,javascript
8,html


In [72]:
prog_lang_dt[:,count(),by(f.languages)]

Unnamed: 0_level_0,languages,count
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,R,4
1,html,3
2,javascript,2


If we would like like to rename a count column as total it can be done as follows,

In [73]:
prog_lang_dt[:,{'total':count()},by(f.languages)]

Unnamed: 0_level_0,languages,total
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,R,4
1,html,3
2,javascript,2


**count** can also take a column name as argument and report how many of non-missing entries in that specific column. for this example we will create a new dataframe as Y.

In [74]:
data = """
       id|charges|payment_method
       634-VHG|28|Cheque
       365-DQC|33.5|Credit card
       264-PPR|631|--
       845-AJO|42.3|
       789-KPO|56.9|Bank Transfer
       """

In [75]:
# read the data
Y = dt.fread(data, na_strings=['--', ''])

In [76]:
Y[:,count(f.payment_method)]

Unnamed: 0_level_0,payment_method
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪
0,3


Here its simpy shows the count of payment methods are 3 and the remaining 2 observations are ignored

### 1.3 How to type cast a dataframe column in pydatatable?

We will create a dataframe with three columns such as cust_id,sales,profit_perc.

In [77]:
sales_DT = dt.Frame(

    {"cust_id":[893232.43],
     
     "sales":[1234532],
     
     "profit_perc":['10.43'],
     
     "default":[1]
    }
)

Check the each column datatype as below - 

In [78]:
sales_DT.stypes

(stype.float64, stype.int32, stype.str32, stype.int8)

Here are some key points:

-  cust_id is a type of float but in general customer id should be eigther integer or string type

-  sales is a type of int, it should not always be an integer and it may also be in float types

-  profict_perc is a type of string, here it should be a float type

**Note:** We have a syntax to be followed when we are to converting a column datatype from one to another as below

                                                DT['Column_name']= new data type (int,floar,str etc etc)

First, we will now try to apply the above formula on conveting a column type from float(**cust_id**) to integer 

In [79]:
sales_DT['cust_id'] = int

Here, let us verify the same whether it has become an integer type or not ?

In [80]:
sales_DT.stypes

(stype.int32, stype.int32, stype.str32, stype.int8)

Yes, it is converted. similarily we can convert a type from int(**sales**) to float and check.

In [81]:
sales_DT['sales'] = float

We have a column default with int type and we can have it as bool type.

In [82]:
sales_DT['default'] = bool

In [83]:
sales_DT.stypes

(stype.int32, stype.float64, stype.str32, stype.bool8)

So far we have seen the convertions from 

- int to float
- float to int 
- int to str 
- float to string 
- int to bool
- etc etc 

In [84]:
sales_DT

Unnamed: 0_level_0,cust_id,sales,profit_perc,default
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪
0,893232,1234530.0,10.43,1


That is OK, you have noticed or tried converting a column from string to any other types **(int,float,bool)**

In [90]:
sales_DT['profit_perc'] = float

NotImplementedError: Unable to cast `str32` into `float64`

**Note:** String to other type converions are not yet implemented in datatable versions till 0.10.1 and they would be surely implemented in the upcoming versions.