jupyter

jupytext

kernelspec

formats

text_representation

ipynb,md

extension	format_name	format_version	jupytext_version
.md	markdown	1.1	1.1.1

display_name	language	name
Python 3	python	python3

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark).

Installation (pip):

In your terminal just type pip install pyoptimus

Requirements

Python>=3.6

Examples

You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

Also you can go to the examples folder to found specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Besides check the Cheat Sheet

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues

Start Optimus

Start Optimus using "pandas", "dask", "cudf" or "dask_cudf".

from optimus import Optimus
op = Optimus("pandas")

Loading data

Now Optimus can load data in csv, json, parquet, avro, excel from a local file or URL.

#csv
df = op.load.csv("../examples/data/foo.csv")

#json
df = op.load.json("../examples/data/foo.json")

# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-3.0/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")

Also, you can load data from oracle, redshift, mysql and postgres.

Saving Data

#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

You can also save data to oracle, redshift, mysql and postgres.

Create dataframes

Also, you can create a dataframe from scratch

df = op.create.dataframe({
    'A': ['a', 'b', 'c', 'd'],
    'B': [1, 3, 5, 7],
    'C': [2, 4, 6, None],
    'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})

Using display you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.

display(df)

Cleaning and Processing

Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the standard DataFrame functionality adding .rows and .cols accessors.

For example you can load data from a url, transform and apply some predefined cleaning functions:

new_df = df\
    .rows.sort("rank", "desc")\
    .cols.lower(["names", "function"])\
    .cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
    .cols.remove_accents("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename("*", str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen", separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])

Troubleshooting

ImportError: failed to find libmagic.  Check your installation

Install libmagic https://anaconda.org/conda-forge/libmagic

Contributing to Optimus

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

Documentation updates, enhancements, designs, or bugfixes.
Spelling or grammar fixes.
README.md corrections or redesigns.
Adding unit, or functional tests
Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
Searching #optimusdata on twitter and helping someone else who needs help.
Blogging, speaking about, or creating tutorials about Optimus and its many features.
Helping others on Discord

Backers

[Become a backer] and get your image on our README on Github with a link to your site.

Core Team

Argenis Leon and Luis Aguirre

Contributors

Here is the amazing people that make Optimus possible:

https://github.com/hi-primus/optimus/graphs/contributors

License

Post-process readme script. Always run this if you modify the notebook.

This will recreate README.md

The bellow script process the readme_.md that is ouputed from this notebook and remove the header from jupytext, python comments and convert/add table to images and output readme.md.

To make table_image() function be sure to install imagekit pip install imgkit Also install wkhtmltopdf https://wkhtmltopdf.org/downloads.html. This is responsible to generate the optimus tables as images

from shutil import copyfile
output_file = "../README.md"
copyfile("readme_.md", output_file)

import sys
import fileinput
import re

pattern = r'"([A-Za-z0-9_\./\\-]*)"'

jupytext_header = False
flag_remove = False

remove = ["load_ext", "autoreload","import sys","sys.path.append"]

buffer = None
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
    done= False
    try:
        # Remove some helper lines
        for r in remove:
            if re.search(r, line):
                done= True
        
        #Remove the post process code
        if re.search("Post-process", line):
            flag_remove = True
            
        if flag_remove is True:
            done = True        
            
        
        # Remove jupytext header
        if jupytext_header is True:
            done = True
            
        if  "---\n" == line: 
            jupytext_header = not jupytext_header      
                    
        elif done is False:
     
            # Replace .table_image(...) by table()
            chars_table=re.search(".table_image", line)
            chars_image=re.search(".to_image", line)
            chars_plot = True if len(re.findall('(.plot.|output_path=)', line))==2 else False
            
            
            
            path = "readme/"
            if chars_table:
                print(line[0:int(chars_table.start())]+".table()")

                m = re.search(r'table_image\("(.*?)"\)', line).group(1)
                if m:
                    buffer = "![]("+ path + m + ")"              
            elif chars_image:
                m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
                if m:
                    buffer = "![]("+ path + m + ")"  
            elif chars_plot:

                m = re.search('output_path="(.*?)"', line).group(1)

                if m:
                    buffer = "![]("+ path + m + ")"  
            
            else:
                sys.stdout.write(line)
                
            if "```\n"==line and buffer:                
                print(buffer)
                buffer = None
                
    except Exception as e:
        print(e)
        
fileinput.close()


# Remove empyt python cells
flag = False
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
   
    if re.search("```python", line):     
        flag = True
    elif re.search("```", line) and flag is True:
        flag=False
    elif flag is True:
        flag = False
        print("```python")
        print(line,end="")
    else:
        print(line, end="")
                    
        
fileinput.close()

line = 'op.profiler.to_image(output_path="images/profiler.png")")'
m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
print(m)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme_.md

readme_.md

Installation (pip):

Requirements

Examples

Feedback

Start Optimus

Loading data

Saving Data

Create dataframes

Cleaning and Processing

Troubleshooting

Contributing to Optimus

Backers

Sponsors

Core Team

Contributors

License

Post-process readme script. Always run this if you modify the notebook.

Files

readme_.md

Latest commit

History

readme_.md

File metadata and controls

Installation (pip):

Requirements

Examples

Feedback

Start Optimus

Loading data

Saving Data

Create dataframes

Cleaning and Processing

Troubleshooting

Contributing to Optimus

Backers

Sponsors

Core Team

Contributors

License

Post-process readme script. Always run this if you modify the notebook.