Skip to content

Commit

Permalink
Added more content to README
Browse files Browse the repository at this point in the history
  • Loading branch information
hosseinmoein committed Apr 29, 2023
1 parent be62050 commit 355c673
Show file tree
Hide file tree
Showing 7 changed files with 301 additions and 11 deletions.
18 changes: 7 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,6 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/hosseinmoein/DataFrame/master)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/hosseinmoein/DataFrame/graphs/commit-activity)

<!--
[![HitCount](http://hits.dwyl.io/hosseinmoein/DataFrame.svg)](http://hits.dwyl.io/hosseinmoein/DataFrame)
-->

<img src="docs/LionLookingUp.jpg" alt="DataFrame Lion" width="400" longdesc="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html"/>

## [*DataFrame Documentation / Code Samples*](https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html)
Expand All @@ -47,12 +43,12 @@ For basic operations to start you off, see [Hello World](examples/hello_world.cc
I have followed a few <B>principles in this library</B>:<BR>

1. Support any type either built-in or user defined without needing new code
2. Never chase pointers ala `linked lists`, `std::any`, `pointer to base`, ..., including `virtual functions`
3. Have all column data in contiguous memory space. Also, be mindful of cache-line aliasing misses between multiple columns
4. Never use more space than you need ala `unions`, `std::variant`, ...
5. Avoid copying data as much as possible
6. Use multi-threading but only when it makes sense
7. Do not attempt to protect the user against `garbage in`, `garbage out`
2. [Never chase pointers ala _linked lists_, _std::any_, _pointer to base_, ...](docs/HTML/pointers.html)
3. [Have all column data in contiguous memory space](docs/HTML/contiguous_memory.html)
4. [Never use more space than you need ala _unions_, _std::variant_, ...](docs/HTML/std_variant.html)
5. [Avoid copying data as much as possible](docs/HTML/copying_data.html)
6. [Use multi-threading but only when it makes sense](docs/HTML/multithreading.html)
7. [Do not attempt to protect the user against _garbage in_, _garbage out_](docs/HTML/garbage_in_garbage_out.html)

[DateTime](docs/DateTimeDoc.pdf)<BR>
DateTime class included in this library is a very cool and handy object to manipulate date/time with nanosecond precision and multi timezone capability.<BR>
Expand Down Expand Up @@ -93,7 +89,7 @@ sys 0m25.983s
1. Pandas script, I believe, is entirely implemented in Numpy which is in C.
2. In case of Pandas, allocating memory + random number generation takes almost the same amount of time as calculating means.
3. In case of DataFrame ~90% of the time is spent in allocating memory + random number generation.
4. You load data once, but calculate statistics many times. So DataFrame, in general, is about ~11x faster than parts of Pandas that are implemented in Numpy. I leave parts of Pandas that are purely in Python to imagination.
4. You load data once, but calculate statistics many times. So DataFrame, in general, is about ~11x faster than parts of Pandas that are implemented in Numpy (i.e. C). I leave parts of Pandas that are purely in Python to imagination.
5. Pandas process image at its peak is ~105GB. C++ DataFrame process image at its peak is ~56GB.

---
Expand Down
46 changes: 46 additions & 0 deletions docs/HTML/contiguous_memory.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

Watch this video: <a href="https://www.youtube.com/watch?v=WDIkqP4JbkE">Scott Meyers: Cpu Caches and Why You Care</a>. Enough said
<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->
46 changes: 46 additions & 0 deletions docs/HTML/copying_data.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

Copying data obviously causes space and time being wasted. But it also could have other side effects. For example, the construction and/or destruction of objects being copied may have side effects. Also, some object may be consumers of expensive resources and copying them may increase resource consumption.
<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->
53 changes: 53 additions & 0 deletions docs/HTML/garbage_in_garbage_out.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

<I>Garbage In, Garbage Out</I> is when a user inputs a dataset that is not suitable for a particular analysis, and (s)he gets nonsense as a result. For example, a data analysis algorithm may require that the input to be normally distributed. If user inputs a dataset that is far from being normal distribution, the result will be garbage.<BR>
One approach would be for the library to check the data first and warn the user that the data is not suitable. DataFrame doesn’t do that because this approach has many pitfalls:
<OL>
<LI>It makes the code inefficient and slow</LI>
<LI>It makes the code convoluted and hard to understand and maintain</LI>
<LI>The check often is more complicated than the algorithm itself and it makes the code bug-prone</LI>
</OL>

<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->
48 changes: 48 additions & 0 deletions docs/HTML/multithreading.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

Multithreading is used for two main reasons; one is to increase throughput and performance and other is to make the code more succinct and structured.<BR>
Using multithreading to increase throughput and performance is very tricky and often it is counterproductive. Unfortunately, it is very hard to have a <I>generic</I> multithreading solution for all problems. It is heavily dependent on the nature of the problem and the hardware/software platform. That is why the multithreading solution in DataFrame is very tunable and requires careful user adjustments. I suggest to always start with a single thread and later when the system is working correctly experiment with multithreading.<BR>
Also, watch this video: <a href="https://www.youtube.com/watch?v=WDIkqP4JbkE">Scott Meyers: Cpu Caches and Why You Care</a>
<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->
51 changes: 51 additions & 0 deletions docs/HTML/pointers.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

Containers of pointers to data are very inefficient because of two main reasons:<BR>
<OL>
<LI>Each pointer points to a "different" memory location. In other words, the data is not stored contiguously. That breaks cache locality and therefore it is a major inefficiency.</LI>
<LI>To access data through a pointer, in general, you must do two things; first you must dereference the pointer and then access the data it is pointing to. For large datasets it becomes an inefficiency.</LI>
</OL>

<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->
50 changes: 50 additions & 0 deletions docs/HTML/std_variant.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
<!--
Copyright (c) 2019-2026, Hossein Moein
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Hossein Moein and/or the DataFrame nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Hossein Moein BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<!DOCTYPE html>
<html>
<body>

For small to small-medium size datasets std::variant (aka type-safe union) is a neat trick. For that kind of datasets combining std::variant with std::visit can be a good substitute for runtime polymorphism. But for large datasets std::variant has two main problems:<BR>
<OL>
<LI>There is extra code that compilers/users must insert to figure out the type of each item inside std::variant in conjunction with std::visit and what to do with it</LI>
<LI>Size of std::variant is the size of its largest member. So, you could be wasting significant memory space</LI>
</OL>
<BR>
<BR>
<BR>

<img src="https://github.com/hosseinmoein/DataFrame/blob/master/docs/LionLookingUp.jpg?raw=true" alt="C++ DataFrame"
width="200" height="150" style="float:right"/>

</body>
</html>

<!--
Local Variables:
mode:HTML
tab-width:4
c-basic-offset:4
End:
-->

0 comments on commit 355c673

Please sign in to comment.